Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Random sample option

See original GitHub issue

A --sample N option for twarc2 search to modify retrieval and get around the API limitation of returning all tweets in reverse chronological order.

eg:

twarc2 search --archive --sample 10 "example query" output.json

To return a 10% sample of results.

Tweet IDs are snowflake ids, which encode a millisecond timestamp. Sampling on the millisecond field is how twitter “samples” 1% streams, by taking a fixed window of millisecond values, which we can emulate here too. Specifying dates as tweet ids using since_id and until_id, it should be possible to get the millisecond time ranges required to effectively create random sample of tweets for a query.

This won’t require any new dependencies, and will purely be based on rewriting the original query parameters to use since_id and until_id and issuing lots of calls (while staying within the rate limit) so it’s maybe not suitable for a plugin, but i’m open to suggestions for deciding to include this in twarc2 vs plugin.

For reference:

https://github.com/client9/snowflake2time

https://twitter.com/JurgenPfeffer/status/1191172359216078849

https://twittercommunity.com/t/generating-a-random-set-of-tweet-ids/150255

Issue Analytics

State:
Created 2 years ago
Reactions:1
Comments:6 (5 by maintainers)

Top GitHub Comments

2reactions

ZacharySTcommented, Oct 27, 2021

Chiming in on use cases. I’m using the full archive search and estimate my queries will return 75 million tweets; Twitter’s academic quota is 10 million. I’d gladly pass a 10% sample parameter to get 7.5 million tweets.

0reactions

igorbrigadircommented, Oct 27, 2021

Yeah the trade off will be the time it takes to download (number of calls): you can make 1 call per second in the full archive search, and 1 call can retrieve a max of 500 tweets. To make a “sample” using this since_id until_id strategy in the worst case it will take roughly 10 times the number of calls to retrieve the same time range as it would normally take, because instead of making 1 call to get 500 tweets for example, it would take 10 calls each one taking one of the small time ranges of ids. I think https://twitter.com/JurgenPfeffer/status/1191172359216078849 explains it better (It’s about the older Labs endpoints but v2 does it exactly the same way)

Top Results From Across the Web

Python random.sample() to choose multiple items from any ...

Use the random.choices() function to select multiple random items from a sequence with repetition. For example, You have a list of names, and ......

Four Types of Random Sampling Techniques Explained with ...

1. Simple Random Sampling · 2. Stratified Random Sampling · 3. Cluster Random Sampling · 4. Systematic Random Sampling.

numpy.random.choice — NumPy v1.24 Manual

Sampling random rows from a 2-D array is not possible with this function, but is possible with Generator.choice through its axis keyword. Examples....

4 Types of Random Sampling Techniques Explained | Built In

Collect unbiased data utilizing these four types of random sampling techniques: systematic, stratified, cluster, and simple random sampling.

Python | random.sample() function - GeeksforGeeks

sample () is an inbuilt function of random module in Python that returns a particular length list of items chosen from the sequence...