flatten and line oriented tweets
See original GitHub issueThe twarc2 flatten
command will make any expansions/fields inline to a tweet and also make the output file line oriented json – where each line is a tweet.
It would be useful for plugin development if twarc.expansions
contained a function (e.g. jsonl
) that returned a list where every element was a tweet dictionary whether or not the object it was passed had a data
key that was a list or an object.
I find myself repeating the same conditional logic in places where I want to process a file of tweets.
Does that make sense? Or maybe flatten()
should just do it?
Issue Analytics
- State:
- Created 2 years ago
- Comments:7 (7 by maintainers)
Top Results From Across the Web
twarc2 (en) - twarc
Tweets are streamed or stored as line-oriented JSON. twarc2 will handle ... "Flatten" tweets, or move expansions inline with tweet objects and ensure...
Read more >Harvesting Twitter Data with twarc - The Carpentries Incubator
Twitter Timelines, and other files we harvest using twarc, need to be 'flattened' before we use them. Flatten will ensure that each line...
Read more >Ed Summers в Twitter: „I'm wondering if twarc should denormalize ...
I'm wondering if twarc should denormalize, or flatten, the included data, by expanding the ID references into their included objects. This would allow...
Read more >twarc-v2-tutorials/twarc_fas.md at master - GitHub
To make your life easier you may want to flatten it first to ensure that each line contains a single tweet. twarc2 flatten...
Read more >flatten-tweet - npm
flatten -tweet is designed to be used to transform the data received from the Twitter v2 API to make it easier to work...
Read more >Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start FreeTop Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Top GitHub Comments
Yes @SamHames I think I’m finally coming around to your idea that twarc2 should always write whatever the API gives it. Maybe I needed the experience of using
--flatten
in the wild to recognize what you suspected early on.Part of the difficulty here are the semantics of flattened.
twarc.expansions.flatten
moves the includes in a response inline to the objects that reference them. It does not ensure thatresponse['data']
is a list as the streaming API returnsresponse['data']
as a dictionary object. This inconsistency was introduced by the Twitter API and not by us, and it means that any code that processes the data will need to peek at the type ofresponse['data']
in order to know what to do. It’s not a big deal but it’s bothersome.But maybe we can still put the genie back in the bottle? I propose that we:
--flatten
from all twarc2 sub-commands, but still offertwarc2 flatten
, mostly for pipeline use.twarc.expansions.flatten
so that it always returns a list of dictionary objects where each object is a tweet e.g.{"id": "12345", "text": "Just setting up my twttr", "author": {"username": "jack" ...
twarc.expansions.ensure_flattened
that checks if data is flattened, and flattens if needed.twarc.expansions.ensure_flattened
directly.Please vote with your thumbs: 👍 👎 or leave comments to propose changes to this plan.
Thanks for catching that BIG mistake @igorbrigadir … I would have thought that the unit tests would have caught it but they were only making sure that the
id
for a referenced tweet was defined, instead of a tweet property that would have actually been added by flattening. Whoops.In cfa3d55f6bb1a87cece21fb5231806b339f0c1d4 I updated the test to test correctly for flattened referenced tweets to see if it failed (it did) and then I added back these lines and it passed.