UnexpectedEmptyPageError at abrupt intervals
See original GitHub issueThank you for developing this package. I am trying to put together a dataset of arXiv paper abstracts and their terms. Basically, the abstracts will be features for a machine learning model and it will be tasked to predict the associated terms making it a multi-label classification problem.
I am doing this for an experiment I want to perform in the area of medium-scale multi-label classification.
Here’s what I am doing:
- Define a list of query strings I want to involve in the dataset:
query_keywords = ["image recognition",
"self-supervised learning",
"representation learning",
"image generation",
"object detection",
"transfer learning",
"transformers",
"adversarial training",
"generative adversarial networks",
"model compressions"
"image segmentation",
"few-shot learning"
]
- Define a utility function:
def query_with_keywords(query):
search = arxiv.Search(query=query,
max_results=3000,
sort_by=arxiv.SortCriterion.LastUpdatedDate)
terms = []
titles = []
abstracts = []
for res in tqdm(search.results()):
if res.primary_category=="cs.CV" or \
res.primary_category=="stat.ML" or \
res.primary_category=="cs.LG":
terms.append(res.categories)
titles.append(res.title)
abstracts.append(res.summary)
return terms, titles, abstracts
- Looping the above function through the list defined in 1.:
import time
wait_time = 3
all_titles = []
all_summaries = []
all_terms = []
for query in query_keywords:
terms, titles, abstracts = query_with_keywords(query)
all_titles.extend(titles)
all_summaries.extend(abstracts)
all_terms.extend(terms)
time.sleep(wait_time)
Now, while executing this I am abruptly running into:
/usr/local/lib/python3.7/dist-packages/arxiv/arxiv.py in __try_parse_feed(self, url, first_page, retries_left, last_err)
687 # Feed was never returned in self.num_retries tries. Raise the last
688 # exception encountered.
--> 689 raise err
690 return feed
691
UnexpectedEmptyPageError: Page of results was unexpectedly empty (http://export.arxiv.org/api/query?search_query=representation+learning&id_list=&sortBy=lastUpdatedDate&sortOrder=descending&start=800&max_results=100)
It’s not like the underlying keyword for search does not have any more pages, I have verified that because in a new run the exception happens for a different keyword.
Was wondering if there’s a way to circumvent this. Thanks so much in advance.
Issue Analytics
- State:
- Created 2 years ago
- Comments:6 (3 by maintainers)
Top Results From Across the Web
Unreliable results: pages from API are unexpectedly empty #43
Each of those log lines is written from UnexpectedEmptyPageError.__init__ . The error is constructed, but it is only raised if all retries are...
Read more >Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start FreeTop Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Top GitHub Comments
@sayakpaul this client configuration seems to work for me (and, incidentally, significantly decreases the overall runtime). Can you confirm whether it solves the issue?
I think this can be solved using a
Client
with a greater number of retries; the API load here isn’t that extreme (360 requests with generous sleep times between requests).This might also benefit from a larger page size than the
Client
default (100). I expect larger page sizes to cause more individual requests to fail, but decreasing the total number of pages fetched might be a net-improvement.Will test a modified client and update here.