Random sample option
See original GitHub issueA --sample N
option for twarc2 search
to modify retrieval and get around the API limitation of returning all tweets in reverse chronological order.
eg:
twarc2 search --archive --sample 10 "example query" output.json
To return a 10% sample of results.
Tweet IDs are snowflake ids, which encode a millisecond timestamp. Sampling on the millisecond field is how twitter “samples” 1% streams, by taking a fixed window of millisecond values, which we can emulate here too. Specifying dates as tweet ids using since_id
and until_id
, it should be possible to get the millisecond time ranges required to effectively create random sample of tweets for a query.
This won’t require any new dependencies, and will purely be based on rewriting the original query parameters to use since_id
and until_id
and issuing lots of calls (while staying within the rate limit) so it’s maybe not suitable for a plugin, but i’m open to suggestions for deciding to include this in twarc2 vs plugin.
For reference:
https://github.com/client9/snowflake2time
https://twitter.com/JurgenPfeffer/status/1191172359216078849
https://twittercommunity.com/t/generating-a-random-set-of-tweet-ids/150255
Issue Analytics
- State:
- Created 2 years ago
- Reactions:1
- Comments:6 (5 by maintainers)
Top GitHub Comments
Chiming in on use cases. I’m using the full archive search and estimate my queries will return 75 million tweets; Twitter’s academic quota is 10 million. I’d gladly pass a 10% sample parameter to get 7.5 million tweets.
Yeah the trade off will be the time it takes to download (number of calls): you can make 1 call per second in the full archive search, and 1 call can retrieve a max of 500 tweets. To make a “sample” using this
since_id
until_id
strategy in the worst case it will take roughly 10 times the number of calls to retrieve the same time range as it would normally take, because instead of making 1 call to get 500 tweets for example, it would take 10 calls each one taking one of the small time ranges of ids. I think https://twitter.com/JurgenPfeffer/status/1191172359216078849 explains it better (It’s about the older Labs endpoints but v2 does it exactly the same way)