Proposal: twarc2 extract [thing] command for driving further collection
See original GitHub issueAs noted in #561 there’s a confusing aspect when using tweet json as input to timelines and conversations commands.
Now that we have timelines, conversations and searches taking input files, I propose we:
- Simplify timelines and conversations so they only take an input file of ids (not tweet json)
- Provide a
twarc2 extract
command, that is focused on extracting interesting things from tweet json and creating a file that can be fed into (back into) those data driven commands. Each option would focus on building a deduplicated list of entries to drive another twarc2 command.
The signature for this command could look like:
twarc2 extract [hashtags|mentions|authors|urls|conversation_ids] input.json output.txt --options-of-some-kind
My hope is that by explicitly representing the intermediates as files, rather than just magic processing
it makes workflows more repeatable and consistent, and it also allows for incorporating human judgment more easily into the process by editing a file.
Example 1
This would allow you to do a hashtag snowball around a keyword of interest by something like the following:
twarc2 search "#auspol" auspol.json
twarc2 extract hashtags auspol.json auspol_hashtags.txt
# In between these steps, it would also be possible to edit the hashtags file first, for example, to identify a specific thematic
# subset of hashtags, or to remove generic hashtags that aren't relevant to the topic of interest.
twarc2 searches hashtags.txt auspol_snowball.json
Example 2
Extracting the conversations from the same collection (this replaces the conversations command reading from JSON)
twarc2 extract conversations auspol.json auspol_threads.txt
twarc2 conversations auspol_threads.txt auspol_threads.json
Example 3
Extracting the timelines for a set of users tweeting about something (this replaces the timelines command reading from JSON)
twarc2 extract authors auspol.json auspol_authors.txt
twarc2 timelines auspol_threads.txt auspol_authors.json
Issue Analytics
- State:
- Created 2 years ago
- Comments:5 (5 by maintainers)
Top GitHub Comments
What you’ve suggested is fine in itself. My only question is whether you (plural) want that functionality to be part of (core) twarc or not - where is it that you’re drawing the lines around the purpose of (core) twarc? That’s not to say don’t do it or that you’re at the danger point yet, just make sure you know how it fits more broadly and keep aware of the dangers of scope creep and starting to just chuck features in.
Yeah, I’m happy to prototype in a plugin first.
I’m going to separately synthesise all of the things somewhere so we can have the “core” twarc discussion somewhere durable/to act as guidance for the future.