question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

UnexpectedEmptyPageError at abrupt intervals

See original GitHub issue

Thank you for developing this package. I am trying to put together a dataset of arXiv paper abstracts and their terms. Basically, the abstracts will be features for a machine learning model and it will be tasked to predict the associated terms making it a multi-label classification problem.

I am doing this for an experiment I want to perform in the area of medium-scale multi-label classification.

Here’s what I am doing:

  1. Define a list of query strings I want to involve in the dataset:
query_keywords = ["image recognition", 
    "self-supervised learning", 
    "representation learning", 
    "image generation",
    "object detection",
    "transfer learning",
    "transformers",
    "adversarial training",
    "generative adversarial networks",
    "model compressions"
    "image segmentation",
    "few-shot learning"
]
  1. Define a utility function:
def query_with_keywords(query):
    search = arxiv.Search(query=query, 
                        max_results=3000,
                        sort_by=arxiv.SortCriterion.LastUpdatedDate)
    terms = []
    titles = []
    abstracts = []
    for res in tqdm(search.results()):
        if res.primary_category=="cs.CV" or \
            res.primary_category=="stat.ML" or \
                res.primary_category=="cs.LG":

            terms.append(res.categories)
            titles.append(res.title)
            abstracts.append(res.summary)
    return terms, titles, abstracts
  1. Looping the above function through the list defined in 1.:
import time

wait_time = 3

all_titles = []
all_summaries = []
all_terms = []

for query in query_keywords:
    terms, titles, abstracts = query_with_keywords(query)
    all_titles.extend(titles)
    all_summaries.extend(abstracts)
    all_terms.extend(terms)

    time.sleep(wait_time)

Now, while executing this I am abruptly running into:

/usr/local/lib/python3.7/dist-packages/arxiv/arxiv.py in __try_parse_feed(self, url, first_page, retries_left, last_err)
    687             # Feed was never returned in self.num_retries tries. Raise the last
    688             # exception encountered.
--> 689             raise err
    690         return feed
    691 

UnexpectedEmptyPageError: Page of results was unexpectedly empty (http://export.arxiv.org/api/query?search_query=representation+learning&id_list=&sortBy=lastUpdatedDate&sortOrder=descending&start=800&max_results=100)

It’s not like the underlying keyword for search does not have any more pages, I have verified that because in a new run the exception happens for a different keyword.

Was wondering if there’s a way to circumvent this. Thanks so much in advance.

Issue Analytics

  • State:closed
  • Created 2 years ago
  • Comments:6 (3 by maintainers)

github_iconTop GitHub Comments

1reaction
lukasschwabcommented, Aug 29, 2021

@sayakpaul this client configuration seems to work for me (and, incidentally, significantly decreases the overall runtime). Can you confirm whether it solves the issue?

import arxiv
from tqdm import tqdm

query_keywords = [
    "image recognition",
    "self-supervised learning",
    "representation learning",
    "image generation",
    "object detection",
    "transfer learning",
    "transformers",
    "adversarial training",
    "generative adversarial networks",
    "model compressions",
    "image segmentation",
    "few-shot learning"
]

# Reuse a client with increased number of retries (3 -> 10) and increased page
# size (100->500).
client = arxiv.Client(num_retries=10, page_size=500)

def query_with_keywords(query):
    search = arxiv.Search(
        query=query,
        max_results=3000,
        sort_by=arxiv.SortCriterion.LastUpdatedDate
    )
    terms = []
    titles = []
    abstracts = []
    for res in tqdm(client.results(search), desc=query):
        if res.primary_category in ["cs.CV", "stat.ML", "cs.LG"]:
            terms.append(res.categories)
            titles.append(res.title)
            abstracts.append(res.summary)
    return terms, titles, abstracts

all_titles = []
all_summaries = []
all_terms = []

for query in query_keywords:
    terms, titles, abstracts = query_with_keywords(query)
    all_titles.extend(titles)
    all_summaries.extend(abstracts)
    all_terms.extend(terms)
1reaction
lukasschwabcommented, Aug 29, 2021

I think this can be solved using a Client with a greater number of retries; the API load here isn’t that extreme (360 requests with generous sleep times between requests).

This might also benefit from a larger page size than the Client default (100). I expect larger page sizes to cause more individual requests to fail, but decreasing the total number of pages fetched might be a net-improvement.

Will test a modified client and update here.

Read more comments on GitHub >

github_iconTop Results From Across the Web

Unreliable results: pages from API are unexpectedly empty #43
Each of those log lines is written from UnexpectedEmptyPageError.__init__ . The error is constructed, but it is only raised if all retries are...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found