Twitter extractor sporadically misses visits from articles contaned in Retweets
See original GitHub issueI have established the Twitter source from archive
extractor (but without the twint
, and there lies a bug, which i will report later), and it works fine on article-pages contained in my own tweets but i see no “context” for articles contained in tweets that i have just re-tweeted them (i.e. the page-url is contained in the retweet).
I’m pasting some retweets of mine that failed to appear as visits when visiting the embedded articles (i know you cannot evaluate it, but it is for me to test them in the future):
- https://twitter.com/bechhof/status/1358844841208160256?s=20 (pasted site)
- https://twitter.com/tparsi/status/1358644141610180612?s=20 (pasted site)
- https://twitter.com/blacktom1961/status/1358123469729386496?s=20 (pasted site)
- https://twitter.com/vouliwatch/status/1337358621298978822?s=20 (pasted site also from Hypothesis context)
While this one DOES appear as a visit in the context of the pasted-site:
- https://twitter.com/unherd/status/1358694617693249537?s=20 (pasted site which happen also to have been hypothesis-annotated by me, but this is probably irrelevant since my last sample (5) above is also in my hypothesis visits)
Questions
a. Is this by the spec of the extractor? But then why did it pick up a visit from the retweet above?? b. Is it due to some site’s peculiarities (e.g. CORS) or of the ReTweet? c. Can i send you something more to debug it? d. Can you tell me in the code where can i add this feature (if by the spec)?
Note that i went further back in time in case today’s twitter-archive has missed some of my recent tweets, and i confirm that my own tweets, always appear as visits in the pasted sites, eg:
Issue Analytics
- State:
- Created 3 years ago
- Comments:8 (8 by maintainers)
Top GitHub Comments
Yep, modifying the originals sounds fragile. Could be OK if there was only a couple of bad entries, but if there are hundreds you probably want to automate it anyway. And at this point it’s just easier to use the same code to just synthesize the correct view in runtime, which brings us to the next option:
Yep! It doesn’t really have to be a whole separate project to start with. To start with I’d just recommend editable install & modifying the HPI code directly. You’d add something like
my/twittter/missing_retweets.py
which would give you thisFor
Tweet
here you can probably just reuse the same class as inmy/twitter/archive.py
… ideally it would something inmy/twitter/common.py
perhaps, but for now it’s ok.After that there are multiple options:
my.twitter.archive.tweets
and fixup the data in it, e.g. by going simultaneously though its original data and your newmissing_retweets
data (and using_replace()
method, or constructing a new correctTweet
to emit).my.twitter.all
in a similar manner if you want to fixup twint data as wellLet me know if it makes sense, or you want extra clarification! After you’ve done that, promnesia should pick up new data automatically, because it just uses
my.twitter.all.tweets
: https://github.com/karlicoss/promnesia/blob/master/src/promnesia/sources/twitter.py#L17Once you do that and you’re happy with the way it works you can decouple it from the original HPI module. I need to document this, there is some info on this here: https://github.com/karlicoss/HPI/blob/master/doc/SETUP.org#addingmodifying-modules
Or there is an example here: https://github.com/karlicoss/hpi-personal-overlay/blob/master/src/my/calendar/holidays.py#L1-L14 ; happy to explain it in more detail too
Hi, thanks for raising the issue! Very good catch, it’s usually hard to notice, so appreciate you sharing.
I just checked my archive, and can confirm similar stuff. For example, I retweeted this tweet: https://twitter.com/lexfridman/status/1268306344924176384 And this is how it looks in archive:
Interesting that, for example
"retweeted" : false
, even though it’s obviosly a retweet (and in fact it’sfalse
for all tweets in my archive!)"truncated": false
, even though it’s obviously truncated (again, for none of the tweets in my archive it’s set totrue
)Some tweets seem to have
urls
properly set… some don’t, can’t really spot a pattern. Typical 😦If you’re curious, this is the bit of code which handles URLs for twitter in Promnesia https://github.com/karlicoss/promnesia/blob/e3b21cb080fa9965802bfd2e931ef4263e3a34e9/src/promnesia/sources/twitter.py#L22-L33
, so it tries to use the urls if they are present, but if empty it also tries to extract it from the body (I’ve had precedents where URLs are there but not set in the field 🤷 ). But can’t do anything if it’s also truncated sadly.
As for twint – unfortunately, it seems sort of unmaintained at the moment and hasn’t really worked for me lately at all.