question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Unreliable results: pages from API are unexpectedly empty

See original GitHub issue

Describe the bug When running the query below, I receive an inconsistent count of results nearly every run. The results below contain the record count generate by the code provided in the “To Reproduce” section. As you can see, I receive wildly different results each time. Is there a parameter setting I can adjust to receive more reliable results?

  • 1: 10,000
  • 2: 10,000
  • 3: 14,800
  • 4: 14,800
  • 5: 14,800
  • 6(no max chunk results): 23,000
  • 7 (no max chunk results): 8,000

To Reproduce import arxiv import pandas as pd test = arxiv.query(query=“quantum”, id_list=[], max_results=None, start = 0, sort_by=“relevance”, sort_order=“descending”, prune=True, iterative=False ,max_chunk_results=1000 ) test_df = pd.DataFrame(test) print(len(test_df))

Expected behavior I am expecting a consistent count of results from this query when run back to back (say within a few minutes or so of each other).

Versions

  • python version: 3.7.4

  • arxiv.py version: 0.5.3

Issue Analytics

  • State:closed
  • Created 3 years ago
  • Reactions:1
  • Comments:9 (5 by maintainers)

github_iconTop GitHub Comments

2reactions
lukasschwabcommented, Apr 2, 2021

Diagnosis

After some extended testing tonight, I’m confident that this is an issue with the underlying arXiv API and not with this client library; I’ll close this issue here accordingly. The team that maintains the arXiv API is focused on building a JSON API to replace the existing Atom API; I’ll raise this issue with them.

Unfortunately, I’m not sure what the root cause is on their end, so I don’t have a recommendation. If you do want to fork and modify this package to add retries (described below), you might consider a smaller max_chunk_results page size than 1000 to make the retries faster.

The issue: the arXiv API sometimes returns a valid, but empty, feed with a 200 status even though there are entries available for the specified query at the specified offset. More at the bottom of this comment.

Available improvements

This client library can––and perhaps should––be modified to mitigate this issue using retries. arXiv feeds include a opensearch:totalResults property indicating the total number of entries corresponding to the query; Search._get_next should

  1. Pull this property to use it to limit pagination: n_left = min(self.max_results, <totalResults from feed>)
  2. Retry if n_left > 0 but results is an empty list.

I spotted an unrelated bug in _prune_result which I’ll fix shortly.

I’m actually inclined to clean up this client more deeply, which will probably lead to a 1.0.0 release (and perhaps an interface that’ll play nicer with the new JSON API when it’s released).

Testing

I tested using the query I constructed earlier in this issue:

import arxiv
test = arxiv.query(query="quantum", id_list=[], max_results=None, start = 0, sort_by="relevance", sort_order="descending", prune=True, iterative=False, max_chunk_results=1000)

I modified two functions to shed some light on why _get_next stopped iterating:

  • I modified _parse to log the requested URL for each request, the resulting HTTP code, and the number of entries.
  • I modified _get_next to, when n_fetched == 0, double-check that result by reinvoking _parse with the arguments that yielded zero entries.

In one such run, I got an empty 200 response at start=6000:

{'bozo': False, 'entries': [], 'feed': {'links': [{'href': 'http://arxiv.org/api/query?search_query%3Dquantum%26id_list%3D%26start%3D6000%26max_results%3D1000', 'rel': 'self', 'type': 'application/atom+xml'}], 'title': 'ArXiv Query: search_query=quantum&amp;id_list=&amp;start=6000&amp;max_results=1000', 'title_detail': {'type': 'text/html', 'language': None, 'base': 'http://export.arxiv.org/api/query?search_query=quantum&id_list=&start=6000&max_results=1000&sortBy=relevance&sortOrder=descending', 'value': 'ArXiv Query: search_query=quantum&amp;id_list=&amp;start=6000&amp;max_results=1000'}, 'id': 'http://arxiv.org/api/U9c7OUmEOZDvAXlaxzJl09rG9z0', 'guidislink': True, 'link': 'http://arxiv.org/api/U9c7OUmEOZDvAXlaxzJl09rG9z0', 'updated': '2021-04-02T00:00:00-04:00', 'updated_parsed': time.struct_time(tm_year=2021, tm_mon=4, tm_mday=2, tm_hour=4, tm_min=0, tm_sec=0, tm_wday=4, tm_yday=92, tm_isdst=0), 'opensearch_totalresults': '320665', 'opensearch_startindex': '6000', 'opensearch_itemsperpage': '1000'}, 'headers': {'date': 'Fri, 02 Apr 2021 04:41:09 GMT', 'server': 'Apache', 'access-control-allow-origin': '*', 'vary': 'Accept-Encoding,User-Agent', 'content-encoding': 'gzip', 'content-length': '412', 'connection': 'close', 'content-type': 'application/atom+xml; charset=UTF-8'}, 'href': 'http://export.arxiv.org/api/query?search_query=quantum&id_list=&start=6000&max_results=1000&sortBy=relevance&sortOrder=descending', 'status': 200, 'encoding': 'UTF-8', 'version': 'atom10', 'namespaces': {'': 'http://www.w3.org/2005/Atom', 'opensearch': 'http://a9.com/-/spec/opensearch/1.1/'}}

But re-calling _parse yielded 1000 entries. In this case, a retry would continue _get_next’s iteration.

1reaction
lukasschwabcommented, Apr 2, 2021

Did some more work on this tonight.

Anecdotally, retries (and other weird behavior like partial pages) seems to happen more with large page size; reducing the page size from 1000 to 100 makes this issue hard to reproduce. Hope that’s helpful!

I’ve started sketching out a v1.0.0 client that adds retries; in my cursory testing so far, a small number of retries (default: 3) seems to make this behave more robustly.

That sketch is here: https://github.com/lukasschwab/arxiv.py/tree/v1.0.0-rewrite

But beware:

  • There’s still a lot of v0.x functionality to reimplement
  • I need to clean up a lot (tests, docs, removing now-unused code)
  • This is definitely a breaking change; I may go so far as to define a result-entry class, to make results here easier to work with than the existing dicts.

Thanks for the input on this issue; I think this’ll lead to a meaningful improvement in this package 😁

Read more comments on GitHub >

github_iconTop Results From Across the Web

How do you know what's gone wrong when your API request ...
Your request was unexpectedly empty, or missing some required parameters; Your request was valid but still ambiguous, so couldn't be handled; Your request...
Read more >
Top 3 Reasons for API Failures | Blazemeter by Perforce
In this article, we'll look at three common reasons why API failures might happen, and how DevOps engineers can address them.
Read more >
How to determine if my Python Requests call to API returns no ...
# 1. Test if response body contains sth. if response.text: # body as str # ... # body = response.content: # body as...
Read more >
user/search API returning empty list `[]` - Atlassian Community
Solved: Hi, One of our customers Censia has integrated with us and we're trying to fetch all users in their Jira. We use...
Read more >
How to Use the Fetch API (Correctly) - CODE Magazine
Next, type in npm run dev to start the lite-server and display a browser with a blank Product Information page. One thing to...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found