question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Twitter extractor sporadically misses visits from articles contaned in Retweets

See original GitHub issue

I have established the Twitter source from archive extractor (but without the twint, and there lies a bug, which i will report later), and it works fine on article-pages contained in my own tweets but i see no “context” for articles contained in tweets that i have just re-tweeted them (i.e. the page-url is contained in the retweet).

I’m pasting some retweets of mine that failed to appear as visits when visiting the embedded articles (i know you cannot evaluate it, but it is for me to test them in the future):

  1. https://twitter.com/bechhof/status/1358844841208160256?s=20 (pasted site)
  2. https://twitter.com/tparsi/status/1358644141610180612?s=20 (pasted site)
  3. https://twitter.com/blacktom1961/status/1358123469729386496?s=20 (pasted site)
  4. https://twitter.com/vouliwatch/status/1337358621298978822?s=20 (pasted site also from Hypothesis context)

While this one DOES appear as a visit in the context of the pasted-site:

  1. https://twitter.com/unherd/status/1358694617693249537?s=20 (pasted site which happen also to have been hypothesis-annotated by me, but this is probably irrelevant since my last sample (5) above is also in my hypothesis visits)

Questions

a. Is this by the spec of the extractor? But then why did it pick up a visit from the retweet above?? b. Is it due to some site’s peculiarities (e.g. CORS) or of the ReTweet? c. Can i send you something more to debug it? d. Can you tell me in the code where can i add this feature (if by the spec)?


Note that i went further back in time in case today’s twitter-archive has missed some of my recent tweets, and i confirm that my own tweets, always appear as visits in the pasted sites, eg:

Issue Analytics

  • State:closed
  • Created 3 years ago
  • Comments:8 (8 by maintainers)

github_iconTop GitHub Comments

1reaction
karlicosscommented, Feb 14, 2021

" I don’t like touching the originals."

Yep, modifying the originals sounds fragile. Could be OK if there was only a couple of bad entries, but if there are hundreds you probably want to automate it anyway. And at this point it’s just easier to use the same code to just synthesize the correct view in runtime, which brings us to the next option:

“Possibly produced by a new standalone module”

Yep! It doesn’t really have to be a whole separate project to start with. To start with I’d just recommend editable install & modifying the HPI code directly. You’d add something like my/twittter/missing_retweets.py which would give you this

def missing_retweets() -> Iterable[Tweet]:
   ...

For Tweet here you can probably just reuse the same class as in my/twitter/archive.py … ideally it would something in my/twitter/common.py perhaps, but for now it’s ok.

After that there are multiple options:

  • you can modify my.twitter.archive.tweets and fixup the data in it, e.g. by going simultaneously though its original data and your new missing_retweets data (and using _replace() method, or constructing a new correct Tweet to emit).
  • alternatively can modify my.twitter.all in a similar manner if you want to fixup twint data as well

Let me know if it makes sense, or you want extra clarification! After you’ve done that, promnesia should pick up new data automatically, because it just uses my.twitter.all.tweets: https://github.com/karlicoss/promnesia/blob/master/src/promnesia/sources/twitter.py#L17

Once you do that and you’re happy with the way it works you can decouple it from the original HPI module. I need to document this, there is some info on this here: https://github.com/karlicoss/HPI/blob/master/doc/SETUP.org#addingmodifying-modules

Or there is an example here: https://github.com/karlicoss/hpi-personal-overlay/blob/master/src/my/calendar/holidays.py#L1-L14 ; happy to explain it in more detail too

1reaction
karlicosscommented, Feb 13, 2021

Hi, thanks for raising the issue! Very good catch, it’s usually hard to notice, so appreciate you sharing.

I just checked my archive, and can confirm similar stuff. For example, I retweeted this tweet: https://twitter.com/lexfridman/status/1268306344924176384 And this is how it looks in archive:

}, {
  "tweet" : {
    "retweeted" : false,
    "source" : "<a href=\"https://mobile.twitter.com\" rel=\"nofollow\">Twitter Web App</a>",
    "entities" : {
      "hashtags" : [ ],
      "symbols" : [ ],
      "user_mentions" : [ {
        "name" : "Lex Fridman",
        "screen_name" : "lexfridman",
        "indices" : [ "3", "14" ],
        "id_str" : "427089628",
        "id" : "427089628"
      } ],
      "urls" : [ ]
    },
    "display_text_range" : [ "0", "140" ],
    "favorite_count" : "0",
    "id_str" : "1269692286906007552",
    "truncated" : false,
    "retweet_count" : "0",
    "id" : "1269692286906007552",
    "created_at" : "Sun Jun 07 18:06:45 +0000 2020",
    "favorited" : false,
    "full_text" : "RT @lexfridman: Here's the 100th episode of the AI podcast: a conversation with my dad, Alexander Fridman, one of the top plasma physicists…",
    "lang" : "en"
  }
}, {

Interesting that, for example

  • "retweeted" : false, even though it’s obviosly a retweet (and in fact it’s false for all tweets in my archive!)
  • "truncated": false, even though it’s obviously truncated (again, for none of the tweets in my archive it’s set to true)

Some tweets seem to have urls properly set… some don’t, can’t really spot a pattern. Typical 😦

If you’re curious, this is the bit of code which handles URLs for twitter in Promnesia https://github.com/karlicoss/promnesia/blob/e3b21cb080fa9965802bfd2e931ef4263e3a34e9/src/promnesia/sources/twitter.py#L22-L33

, so it tries to use the urls if they are present, but if empty it also tries to extract it from the body (I’ve had precedents where URLs are there but not set in the field 🤷 ). But can’t do anything if it’s also truncated sadly.

As for twint – unfortunately, it seems sort of unmaintained at the moment and hasn’t really worked for me lately at all.

Read more comments on GitHub >

github_iconTop Results From Across the Web

Filtered stream - How to build a rule | Docs - Twitter Developer
Please note: If you are moving an app between any Projects, any rules you have created on the filtered stream endpoint will be...
Read more >
rtweet.pdf - The Comprehensive R Archive Network
Description An implementation of calls designed to collect and organize. Twitter data via Twitter's REST and stream Application Program.
Read more >
How to Use Twitter Analytics: The Complete Guide - Buffer
Looking for Twitter analytics & individual tweet metrics? See our list of the 15 best Twitter stats that you can use to be...
Read more >
An exploratory study of COVID-19 misinformation on Twitter
We have collected all tweets mentioned in the verdicts of fact-checked ... Extraction of social media link (Tweet Link) on the fact-checked article...
Read more >
Influence of fake news in Twitter during the 2016 US ... - Nature
found that, during the 2016 US presidential election on Twitter, bots were responsible for the early promotion of misinformation, that they ...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found