Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

UnexpectedEmptyPageError at abrupt intervals

See original GitHub issue

Thank you for developing this package. I am trying to put together a dataset of arXiv paper abstracts and their terms. Basically, the abstracts will be features for a machine learning model and it will be tasked to predict the associated terms making it a multi-label classification problem.

I am doing this for an experiment I want to perform in the area of medium-scale multi-label classification.

Here’s what I am doing:

Define a list of query strings I want to involve in the dataset:

query_keywords = ["image recognition", 
    "self-supervised learning", 
    "representation learning", 
    "image generation",
    "object detection",
    "transfer learning",
    "transformers",
    "adversarial training",
    "generative adversarial networks",
    "model compressions"
    "image segmentation",
    "few-shot learning"
]

Define a utility function:

def query_with_keywords(query):
    search = arxiv.Search(query=query, 
                        max_results=3000,
                        sort_by=arxiv.SortCriterion.LastUpdatedDate)
    terms = []
    titles = []
    abstracts = []
    for res in tqdm(search.results()):
        if res.primary_category=="cs.CV" or \
            res.primary_category=="stat.ML" or \
                res.primary_category=="cs.LG":

            terms.append(res.categories)
            titles.append(res.title)
            abstracts.append(res.summary)
    return terms, titles, abstracts

Looping the above function through the list defined in 1.:

import time

wait_time = 3

all_titles = []
all_summaries = []
all_terms = []

for query in query_keywords:
    terms, titles, abstracts = query_with_keywords(query)
    all_titles.extend(titles)
    all_summaries.extend(abstracts)
    all_terms.extend(terms)

    time.sleep(wait_time)

Now, while executing this I am abruptly running into:

/usr/local/lib/python3.7/dist-packages/arxiv/arxiv.py in __try_parse_feed(self, url, first_page, retries_left, last_err)
    687             # Feed was never returned in self.num_retries tries. Raise the last
    688             # exception encountered.
--> 689             raise err
    690         return feed
    691 

UnexpectedEmptyPageError: Page of results was unexpectedly empty (http://export.arxiv.org/api/query?search_query=representation+learning&id_list=&sortBy=lastUpdatedDate&sortOrder=descending&start=800&max_results=100)

It’s not like the underlying keyword for search does not have any more pages, I have verified that because in a new run the exception happens for a different keyword.

Was wondering if there’s a way to circumvent this. Thanks so much in advance.

Issue Analytics

State:
Created 2 years ago
Comments:6 (3 by maintainers)

Top GitHub Comments

1reaction

lukasschwabcommented, Aug 29, 2021

@sayakpaul this client configuration seems to work for me (and, incidentally, significantly decreases the overall runtime). Can you confirm whether it solves the issue?

import arxiv
from tqdm import tqdm

query_keywords = [
    "image recognition",
    "self-supervised learning",
    "representation learning",
    "image generation",
    "object detection",
    "transfer learning",
    "transformers",
    "adversarial training",
    "generative adversarial networks",
    "model compressions",
    "image segmentation",
    "few-shot learning"
]

# Reuse a client with increased number of retries (3 -> 10) and increased page
# size (100->500).
client = arxiv.Client(num_retries=10, page_size=500)

def query_with_keywords(query):
    search = arxiv.Search(
        query=query,
        max_results=3000,
        sort_by=arxiv.SortCriterion.LastUpdatedDate
    )
    terms = []
    titles = []
    abstracts = []
    for res in tqdm(client.results(search), desc=query):
        if res.primary_category in ["cs.CV", "stat.ML", "cs.LG"]:
            terms.append(res.categories)
            titles.append(res.title)
            abstracts.append(res.summary)
    return terms, titles, abstracts

all_titles = []
all_summaries = []
all_terms = []

for query in query_keywords:
    terms, titles, abstracts = query_with_keywords(query)
    all_titles.extend(titles)
    all_summaries.extend(abstracts)
    all_terms.extend(terms)

1reaction

lukasschwabcommented, Aug 29, 2021

I think this can be solved using a Client with a greater number of retries; the API load here isn’t that extreme (360 requests with generous sleep times between requests).

This might also benefit from a larger page size than the Client default (100). I expect larger page sizes to cause more individual requests to fail, but decreasing the total number of pages fetched might be a net-improvement.

Will test a modified client and update here.