Unreliable results: pages from API are unexpectedly empty
See original GitHub issueDescribe the bug When running the query below, I receive an inconsistent count of results nearly every run. The results below contain the record count generate by the code provided in the “To Reproduce” section. As you can see, I receive wildly different results each time. Is there a parameter setting I can adjust to receive more reliable results?
- 1: 10,000
- 2: 10,000
- 3: 14,800
- 4: 14,800
- 5: 14,800
- 6(no max chunk results): 23,000
- 7 (no max chunk results): 8,000
To Reproduce import arxiv import pandas as pd test = arxiv.query(query=“quantum”, id_list=[], max_results=None, start = 0, sort_by=“relevance”, sort_order=“descending”, prune=True, iterative=False ,max_chunk_results=1000 ) test_df = pd.DataFrame(test) print(len(test_df))
Expected behavior I am expecting a consistent count of results from this query when run back to back (say within a few minutes or so of each other).
Versions
-
python
version: 3.7.4 -
arxiv.py
version: 0.5.3
Issue Analytics
- State:
- Created 3 years ago
- Reactions:1
- Comments:9 (5 by maintainers)
Top GitHub Comments
Diagnosis
After some extended testing tonight, I’m confident that this is an issue with the underlying arXiv API and not with this client library; I’ll close this issue here accordingly. The team that maintains the arXiv API is focused on building a JSON API to replace the existing Atom API; I’ll raise this issue with them.
Unfortunately, I’m not sure what the root cause is on their end, so I don’t have a recommendation. If you do want to fork and modify this package to add retries (described below), you might consider a smaller
max_chunk_results
page size than 1000 to make the retries faster.The issue: the arXiv API sometimes returns a valid, but empty, feed with a 200 status even though there are entries available for the specified query at the specified offset. More at the bottom of this comment.
Available improvements
This client library can––and perhaps should––be modified to mitigate this issue using retries. arXiv feeds include a
opensearch:totalResults
property indicating the total number of entries corresponding to the query;Search._get_next
shouldn_left = min(self.max_results, <totalResults from feed>)
n_left > 0
butresults
is an empty list.I spotted an unrelated bug in
_prune_result
which I’ll fix shortly.I’m actually inclined to clean up this client more deeply, which will probably lead to a 1.0.0 release (and perhaps an interface that’ll play nicer with the new JSON API when it’s released).
Testing
I tested using the query I constructed earlier in this issue:
I modified two functions to shed some light on why
_get_next
stopped iterating:_parse
to log the requested URL for each request, the resulting HTTP code, and the number of entries._get_next
to, whenn_fetched == 0
, double-check that result by reinvoking_parse
with the arguments that yielded zero entries.In one such run, I got an empty 200 response at
start=6000
:But re-calling
_parse
yielded 1000 entries. In this case, a retry would continue_get_next
’s iteration.Did some more work on this tonight.
Anecdotally, retries (and other weird behavior like partial pages) seems to happen more with large page size; reducing the page size from 1000 to 100 makes this issue hard to reproduce. Hope that’s helpful!
I’ve started sketching out a v1.0.0 client that adds retries; in my cursory testing so far, a small number of retries (default: 3) seems to make this behave more robustly.
That sketch is here: https://github.com/lukasschwab/arxiv.py/tree/v1.0.0-rewrite
But beware:
Thanks for the input on this issue; I think this’ll lead to a meaningful improvement in this package 😁