question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Random sample option

See original GitHub issue

A --sample N option for twarc2 search to modify retrieval and get around the API limitation of returning all tweets in reverse chronological order.

eg:

twarc2 search --archive --sample 10 "example query" output.json

To return a 10% sample of results.

Tweet IDs are snowflake ids, which encode a millisecond timestamp. Sampling on the millisecond field is how twitter “samples” 1% streams, by taking a fixed window of millisecond values, which we can emulate here too. Specifying dates as tweet ids using since_id and until_id, it should be possible to get the millisecond time ranges required to effectively create random sample of tweets for a query.

This won’t require any new dependencies, and will purely be based on rewriting the original query parameters to use since_id and until_id and issuing lots of calls (while staying within the rate limit) so it’s maybe not suitable for a plugin, but i’m open to suggestions for deciding to include this in twarc2 vs plugin.

For reference:

https://github.com/client9/snowflake2time

https://twitter.com/JurgenPfeffer/status/1191172359216078849

https://twittercommunity.com/t/generating-a-random-set-of-tweet-ids/150255

Issue Analytics

  • State:open
  • Created 2 years ago
  • Reactions:1
  • Comments:6 (5 by maintainers)

github_iconTop GitHub Comments

2reactions
ZacharySTcommented, Oct 27, 2021

Chiming in on use cases. I’m using the full archive search and estimate my queries will return 75 million tweets; Twitter’s academic quota is 10 million. I’d gladly pass a 10% sample parameter to get 7.5 million tweets.

0reactions
igorbrigadircommented, Oct 27, 2021

Yeah the trade off will be the time it takes to download (number of calls): you can make 1 call per second in the full archive search, and 1 call can retrieve a max of 500 tweets. To make a “sample” using this since_id until_id strategy in the worst case it will take roughly 10 times the number of calls to retrieve the same time range as it would normally take, because instead of making 1 call to get 500 tweets for example, it would take 10 calls each one taking one of the small time ranges of ids. I think https://twitter.com/JurgenPfeffer/status/1191172359216078849 explains it better (It’s about the older Labs endpoints but v2 does it exactly the same way)

Read more comments on GitHub >

github_iconTop Results From Across the Web

Python random.sample() to choose multiple items from any ...
Use the random.choices() function to select multiple random items from a sequence with repetition. For example, You have a list of names, and ......
Read more >
Four Types of Random Sampling Techniques Explained with ...
1. Simple Random Sampling · 2. Stratified Random Sampling · 3. Cluster Random Sampling · 4. Systematic Random Sampling.
Read more >
numpy.random.choice — NumPy v1.24 Manual
Sampling random rows from a 2-D array is not possible with this function, but is possible with Generator.choice through its axis keyword. Examples....
Read more >
4 Types of Random Sampling Techniques Explained | Built In
Collect unbiased data utilizing these four types of random sampling techniques: systematic, stratified, cluster, and simple random sampling.
Read more >
Python | random.sample() function - GeeksforGeeks
sample () is an inbuilt function of random module in Python that returns a particular length list of items chosen from the sequence...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found