question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Proposal: twarc2 extract [thing] command for driving further collection

See original GitHub issue

As noted in #561 there’s a confusing aspect when using tweet json as input to timelines and conversations commands.

Now that we have timelines, conversations and searches taking input files, I propose we:

  1. Simplify timelines and conversations so they only take an input file of ids (not tweet json)
  2. Provide a twarc2 extract command, that is focused on extracting interesting things from tweet json and creating a file that can be fed into (back into) those data driven commands. Each option would focus on building a deduplicated list of entries to drive another twarc2 command.

The signature for this command could look like:

twarc2 extract [hashtags|mentions|authors|urls|conversation_ids] input.json output.txt --options-of-some-kind

My hope is that by explicitly representing the intermediates as files, rather than just magic processing it makes workflows more repeatable and consistent, and it also allows for incorporating human judgment more easily into the process by editing a file.

Example 1

This would allow you to do a hashtag snowball around a keyword of interest by something like the following:

twarc2 search "#auspol" auspol.json
twarc2 extract hashtags auspol.json auspol_hashtags.txt
# In between these steps, it would also be possible to edit the hashtags file first, for example, to identify a specific thematic
# subset of hashtags, or to remove generic hashtags that aren't relevant to the topic of interest.
twarc2 searches hashtags.txt auspol_snowball.json

Example 2

Extracting the conversations from the same collection (this replaces the conversations command reading from JSON)

twarc2 extract conversations auspol.json auspol_threads.txt
twarc2 conversations auspol_threads.txt auspol_threads.json

Example 3

Extracting the timelines for a set of users tweeting about something (this replaces the timelines command reading from JSON)

twarc2 extract authors auspol.json auspol_authors.txt
twarc2 timelines auspol_threads.txt auspol_authors.json

Issue Analytics

  • State:open
  • Created 2 years ago
  • Comments:5 (5 by maintainers)

github_iconTop GitHub Comments

2reactions
betsybookwyrmcommented, Oct 22, 2021

What you’ve suggested is fine in itself. My only question is whether you (plural) want that functionality to be part of (core) twarc or not - where is it that you’re drawing the lines around the purpose of (core) twarc? That’s not to say don’t do it or that you’re at the danger point yet, just make sure you know how it fits more broadly and keep aware of the dangers of scope creep and starting to just chuck features in.

1reaction
SamHamescommented, Oct 22, 2021

Yeah, I’m happy to prototype in a plugin first.

I do worry that the core twarc command is getting quite heavy. Do you think part 1 of that proposal makes sense along those lines, at least once there’s an equivalent plugin?

I’m going to separately synthesise all of the things somewhere so we can have the “core” twarc discussion somewhere durable/to act as guidance for the future.

Read more comments on GitHub >

github_iconTop Results From Across the Web

twarc2 (en) - twarc
twarc2 is a command line tool and Python library for archiving Twitter JSON data. Each tweet is represented as a JSON object that...
Read more >
What is the right version and price? - Academic Research
I would begin here, to get the basics of setting up a command line utility up: Collect Twitter Data with Twarc! ... Learn...
Read more >
Code - Tweet Extraction from Command Line with Twarc2
I just found the easiest (and the maximalist!) way of getting tweet data using Twitter API v2. And the monthly cap is 10...
Read more >
CS4624 Multimedia, Hypertext, and Information Access May ...
extracted the tweet IDs from the stored tweets collected, and, using the Twarc2 library, we were able to hydrate still existing – i.e., ......
Read more >
Start Collecting: Twarc Command Basics - Scholars' Lab Repo
Use the Table of Contents below to quickly navigate this guide. Table of contents. Search; More Search Examples; Filter; Users; Followers; Friends; Dehydrated ......
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found