conversations module "sleep" inconsistencies
See original GitHub issueHi there, I feel like I’m running into a similar issue as described here: https://twittercommunity.com/t/inconsistent-rate-limit-academic-research-full-archive-search/162928/18?u=igorbrigadir
and here: https://github.com/DocNow/twarc/pull/578.
I too am fetching all tweets related to a conversation id using the twarc2 command:
Twarc2 conversations --archive input_conversation_ids.txt output_conversation_tweets.jsonl
But, I’m finding that it is doing far fewer than the 300 requests per 15 minutes that my academic Twitter API account has been allotted.
I’m using the latest version of twarc 2.8.2
Here is some of the log that I’m seeing (Notice that at 10:14:24 it stops…and then doesn’t restart until 11:23:27 for no clear reason.
2022-01-01 10:04:32,148 INFO fetching conversation 290902007206772736
2022-01-01 10:04:32,148 INFO getting ('https://api.twitter.com/2/tweets/search/all',) {'params': {'expansions': 'author_id,in_reply_to_user_id,referenced_tweets.id,referenced_tweets.id.author_id,entities.mentions.username,attachments.poll_ids,attachments.media_keys,geo.place_id', 'tweet.fields': 'attachments,author_id,context_annotations,conversation_id,created_at,entities,geo,id,in_reply_to_user_id,lang,public_metrics,text,possibly_sensitive,referenced_tweets,reply_settings,source,withheld', 'user.fields': 'created_at,description,entities,id,location,name,pinned_tweet_id,profile_image_url,protected,public_metrics,url,username,verified,withheld', 'media.fields': 'alt_text,duration_ms,height,media_key,preview_image_url,type,url,width,public_metrics', 'poll.fields': 'duration_minutes,end_datetime,id,options,voting_status', 'place.fields': 'contained_within,country,country_code,full_name,geo,id,name,place_type', 'start_time': '2006-03-21T00:00:00+00:00', 'end_time': None, 'query': 'conversation_id:290902007206772736', 'max_results': 100}}
2022-01-01 10:04:32,174 WARNING rate limit exceeded: sleeping 591.8250887393951 secs
2022-01-01 10:14:24,003 INFO getting ('https://api.twitter.com/2/tweets/search/all',) {'params': {'expansions': 'author_id,in_reply_to_user_id,referenced_tweets.id,referenced_tweets.id.author_id,entities.mentions.username,attachments.poll_ids,attachments.media_keys,geo.place_id', 'tweet.fields': 'attachments,author_id,context_annotations,conversation_id,created_at,entities,geo,id,in_reply_to_user_id,lang,public_metrics,text,possibly_sensitive,referenced_tweets,reply_settings,source,withheld', 'user.fields': 'created_at,description,entities,id,location,name,pinned_tweet_id,profile_image_url,protected,public_metrics,url,username,verified,withheld', 'media.fields': 'alt_text,duration_ms,height,media_key,preview_image_url,type,url,width,public_metrics', 'poll.fields': 'duration_minutes,end_datetime,id,options,voting_status', 'place.fields': 'contained_within,country,country_code,full_name,geo,id,name,place_type', 'start_time': '2006-03-21T00:00:00+00:00', 'end_time': None, 'query': 'conversation_id:290902007206772736', 'max_results': 100}}
2022-01-01 10:14:24,117 INFO Retrieved an empty page of results.
2022-01-01 10:14:24,117 INFO No more results for search conversation_id:290902007206772736.
2022-01-01 11:23:27,550 INFO fetching conversation 290902009131966464
2022-01-01 11:23:27,551 INFO getting ('https://api.twitter.com/2/tweets/search/all',) {'params': {'expansions': 'author_id,in_reply_to_user_id,referenced_tweets.id,referenced_tweets.id.author_id,entities.mentions.username,attachments.poll_ids,attachments.media_keys,geo.place_id', 'tweet.fields': 'attachments,author_id,context_annotations,conversation_id,created_at,entities,geo,id,in_reply_to_user_id,lang,public_metrics,text,possibly_sensitive,referenced_tweets,reply_settings,source,withheld', 'user.fields': 'created_at,description,entities,id,location,name,pinned_tweet_id,profile_image_url,protected,public_metrics,url,username,verified,withheld', 'media.fields': 'alt_text,duration_ms,height,media_key,preview_image_url,type,url,width,public_metrics', 'poll.fields': 'duration_minutes,end_datetime,id,options,voting_status', 'place.fields': 'contained_within,country,country_code,full_name,geo,id,name,place_type', 'start_time': '2006-03-21T00:00:00+00:00', 'end_time': None, 'query': 'conversation_id:290902009131966464', 'max_results': 100}}
2022-01-01 11:23:28,605 INFO Retrieved an empty page of results.
2022-01-01 11:23:28,605 INFO No more results for search conversation_id:290902009131966464.
2022-01-01 11:23:28,607 INFO fetching conversation 290902010599964672
I’m also getting a lot of these warnings about "overlong sleep interval"s:
2022-01-01 15:53:56,688 WARNING Detected overlong sleep interval - is your system clock accurate? An accurate system time is needed to calculate how long to sleep for, and data collection might be slowed.
2022-01-01 15:53:56,688 WARNING rate limit exceeded: sleeping 901 secs
2022-01-01 16:08:57,693 INFO getting ('https://api.twitter.com/2/tweets/search/all',) {'params': {'expansions': 'author_id,in_reply_to_user_id,referenced_tweets.id,referenced_tweets.id.author_id,entities.mentions.username,attachments.poll_ids,attachments.media_keys,geo.place_id', 'tweet.fields': 'attachments,author_id,context_annotations,conversation_id,created_at,entities,geo,id,in_reply_to_user_id,lang,public_metrics,text,possibly_sensitive,referenced_tweets,reply_settings,source,withheld', 'user.fields': 'created_at,description,entities,id,location,name,pinned_tweet_id,profile_image_url,protected,public_metrics,url,username,verified,withheld', 'media.fields': 'alt_text,duration_ms,height,media_key,preview_image_url,type,url,width,public_metrics', 'poll.fields': 'duration_minutes,end_datetime,id,options,voting_status', 'place.fields': 'contained_within,country,country_code,full_name,geo,id,name,place_type', 'start_time': '2006-03-21T00:00:00+00:00', 'end_time': None, 'query': 'conversation_id:291011973909467136', 'max_results': 100}}
2022-01-01 16:08:57,795 INFO Retrieved an empty page of results.
2022-01-01 16:08:57,795 INFO No more results for search conversation_id:291011973909467136.
2022-01-01 16:08:57,795 INFO fetching conversation 291018004416839680
2022-01-01 16:08:57,796 INFO getting ('https://api.twitter.com/2/tweets/search/all',) {'params': {'expansions': 'author_id,in_reply_to_user_id,referenced_tweets.id,referenced_tweets.id.author_id,entities.mentions.username,attachments.poll_ids,attachments.media_keys,geo.place_id', 'tweet.fields': 'attachments,author_id,context_annotations,conversation_id,created_at,entities,geo,id,in_reply_to_user_id,lang,public_metrics,text,possibly_sensitive,referenced_tweets,reply_settings,source,withheld', 'user.fields': 'created_at,description,entities,id,location,name,pinned_tweet_id,profile_image_url,protected,public_metrics,url,username,verified,withheld', 'media.fields': 'alt_text,duration_ms,height,media_key,preview_image_url,type,url,width,public_metrics', 'poll.fields': 'duration_minutes,end_datetime,id,options,voting_status', 'place.fields': 'contained_within,country,country_code,full_name,geo,id,name,place_type', 'start_time': '2006-03-21T00:00:00+00:00', 'end_time': None, 'query': 'conversation_id:291018004416839680', 'max_results': 100}}
2022-01-01 16:08:57,820 WARNING Detected overlong sleep interval - is your system clock accurate? An accurate system time is needed to calculate how long to sleep for, and data collection might be slowed.
2022-01-01 16:08:57,820 WARNING rate limit exceeded: sleeping 901 secs
2022-01-01 16:23:58,834 INFO getting ('https://api.twitter.com/2/tweets/search/all',) {'params': {'expansions': 'author_id,in_reply_to_user_id,referenced_tweets.id,referenced_tweets.id.author_id,entities.mentions.username,attachments.poll_ids,attachments.media_keys,geo.place_id', 'tweet.fields': 'attachments,author_id,context_annotations,conversation_id,created_at,entities,geo,id,in_reply_to_user_id,lang,public_metrics,text,possibly_sensitive,referenced_tweets,reply_settings,source,withheld', 'user.fields': 'created_at,description,entities,id,location,name,pinned_tweet_id,profile_image_url,protected,public_metrics,url,username,verified,withheld', 'media.fields': 'alt_text,duration_ms,height,media_key,preview_image_url,type,url,width,public_metrics', 'poll.fields': 'duration_minutes,end_datetime,id,options,voting_status', 'place.fields': 'contained_within,country,country_code,full_name,geo,id,name,place_type', 'start_time': '2006-03-21T00:00:00+00:00', 'end_time': None, 'query': 'conversation_id:291018004416839680', 'max_results': 100}}
2022-01-01 16:23:58,967 INFO Retrieved an empty page of results.
2022-01-01 16:23:58,967 INFO No more results for search conversation_id:291018004416839680.
2022-01-01 16:23:58,968 INFO fetching conversation 291062323647500288
2022-01-01 16:23:58,968 INFO getting ('https://api.twitter.com/2/tweets/search/all',) {'params': {'expansions': 'author_id,in_reply_to_user_id,referenced_tweets.id,referenced_tweets.id.author_id,entities.mentions.username,attachments.poll_ids,attachments.media_keys,geo.place_id', 'tweet.fields': 'attachments,author_id,context_annotations,conversation_id,created_at,entities,geo,id,in_reply_to_user_id,lang,public_metrics,text,possibly_sensitive,referenced_tweets,reply_settings,source,withheld', 'user.fields': 'created_at,description,entities,id,location,name,pinned_tweet_id,profile_image_url,protected,public_metrics,url,username,verified,withheld', 'media.fields': 'alt_text,duration_ms,height,media_key,preview_image_url,type,url,width,public_metrics', 'poll.fields': 'duration_minutes,end_datetime,id,options,voting_status', 'place.fields': 'contained_within,country,country_code,full_name,geo,id,name,place_type', 'start_time': '2006-03-21T00:00:00+00:00', 'end_time': None, 'query': 'conversation_id:291062323647500288', 'max_results': 100}}
2022-01-01 16:23:58,992 WARNING Detected overlong sleep interval - is your system clock accurate? An accurate system time is needed to calculate how long to sleep for, and data collection might be slowed.
2022-01-01 16:23:58,992 WARNING rate limit exceeded: sleeping 901 secs
Any help would be appreciated as at this rate, I won’t put this dataset together within any reasonable amount of time.
Issue Analytics
- State:
- Created 2 years ago
- Comments:21 (11 by maintainers)
Top GitHub Comments
Here’s the log after roughly 1000 requests. So far, only seeing the non-900 second warnings every 300 requests. Not seeing 10 second warnings. Not seeing back-to-back long sleeps.
So far looking good…😃 sample_log (twarc-dev1).log
Oh wait, looking more closely at the log, it is definitely a bug, we’re just making too many calls… responses with no results are the most notable place, probably because they’re returning extremely fast.