Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

flatten and line oriented tweets

See original GitHub issue

The twarc2 flatten command will make any expansions/fields inline to a tweet and also make the output file line oriented json – where each line is a tweet.

It would be useful for plugin development if twarc.expansions contained a function (e.g. jsonl) that returned a list where every element was a tweet dictionary whether or not the object it was passed had a data key that was a list or an object.

I find myself repeating the same conditional logic in places where I want to process a file of tweets.

Does that make sense? Or maybe flatten() should just do it?

Issue Analytics

State:
Created 2 years ago
Comments:7 (7 by maintainers)

Top GitHub Comments

2reactions

edsucommented, May 28, 2021

Yes @SamHames I think I’m finally coming around to your idea that twarc2 should always write whatever the API gives it. Maybe I needed the experience of using --flatten in the wild to recognize what you suspected early on.

Part of the difficulty here are the semantics of flattened. twarc.expansions.flatten moves the includes in a response inline to the objects that reference them. It does not ensure that response['data'] is a list as the streaming API returns response['data'] as a dictionary object. This inconsistency was introduced by the Twitter API and not by us, and it means that any code that processes the data will need to peek at the type of response['data'] in order to know what to do. It’s not a big deal but it’s bothersome.

But maybe we can still put the genie back in the bottle? I propose that we:

remove --flatten from all twarc2 sub-commands, but still offer twarc2 flatten, mostly for pipeline use.
update twarc.expansions.flatten so that it always returns a list of dictionary objects where each object is a tweet e.g. {"id": "12345", "text": "Just setting up my twttr", "author": {"username": "jack" ...
add twarc.expansions.ensure_flattened that checks if data is flattened, and flattens if needed.
Plugins and other code that want to process flattened data can use twarc.expansions.ensure_flattened directly.

Please vote with your thumbs: 👍 👎 or leave comments to propose changes to this plan.

1reaction

edsucommented, May 31, 2021

Thanks for catching that BIG mistake @igorbrigadir … I would have thought that the unit tests would have caught it but they were only making sure that the id for a referenced tweet was defined, instead of a tweet property that would have actually been added by flattening. Whoops.

In cfa3d55f6bb1a87cece21fb5231806b339f0c1d4 I updated the test to test correctly for flattened referenced tweets to see if it failed (it did) and then I added back these lines and it passed.

Top Results From Across the Web

twarc2 (en) - twarc

Tweets are streamed or stored as line-oriented JSON. twarc2 will handle ... "Flatten" tweets, or move expansions inline with tweet objects and ensure...

Harvesting Twitter Data with twarc - The Carpentries Incubator

Twitter Timelines, and other files we harvest using twarc, need to be 'flattened' before we use them. Flatten will ensure that each line...

Ed Summers в Twitter: „I'm wondering if twarc should denormalize ...

I'm wondering if twarc should denormalize, or flatten, the included data, by expanding the ID references into their included objects. This would allow...

twarc-v2-tutorials/twarc_fas.md at master - GitHub

To make your life easier you may want to flatten it first to ensure that each line contains a single tweet. twarc2 flatten...

flatten-tweet - npm

flatten -tweet is designed to be used to transform the data received from the Twitter v2 API to make it easier to work...