`get_short_id` incorrect for pre-March 2007 arXiv identifiers: missing archive
See original GitHub issueError:
/usr/local/lib/python3.7/dist-packages/arxiv/arxiv.py in results(self, search)
552 ))
553 page_url = self._format_url(search, offset, page_size)
--> 554 feed = self._parse_feed(page_url, first_page)
555 if first_page:
556 # NOTE: this is an ugly fix for a known bug. The totalresults
/usr/local/lib/python3.7/dist-packages/arxiv/arxiv.py in _parse_feed(self, url, first_page)
635 # Feed was never returned in self.num_retries tries. Raise the last
636 # exception encountered.
--> 637 raise err
638
639
HTTPError: arxiv.HTTPError(Page request resulted in HTTP 400)
Code for parsing id from arxiv result object-
id = urlparse(result.entry_id).path.split('/')[-1].split('v')[0]
code to reproduce -
ids = ['1911.10854', '1905.00256', '0112019', '1202.2184', '1708.03109', '0205137', '1610.08147', '2003.05245', '0406182', '0708.3630', '0503148', '1111.6170', '1612.04479', '0307110', '0306127', '1307.2727', '0402059', '1012.4706', '1906.01999', '0101032']
papers = arxiv.Search(id_list=ids).get()
invalid ids are '0112019', '0205137'
etc
respective pdf urls still accessible, for example : https://arxiv.org/pdf/quant-ph/0112019.pdf
The same error is referenced in another open issue but from the perspective of huge id arrays. [ issue ID : #15]
Apologies if I simply lack sufficient knowledge about identifier naming conventions but it should download from all research fields right?
Issue Analytics
- State:
- Created 2 years ago
- Comments:5 (4 by maintainers)
Top Results From Across the Web
Understanding the arXiv identifier | arXiv e-print repository
Identifiers up to March 2007 (9107-0703) Instead, each archive represents a subject class, e.g., hep-ex, hep-lat, hep-ph, and hep-th. The astro ...
Read more >Author Identifiers | arXiv e-print repository
Author Identifiers. It is a long-term goal of arXiv to accurately identify and disambiguate all authors of all articles in arXiv.
Read more >To replace an article | arXiv e-print repository
We ask that articles be replaced no more than once per week. Note that if your article or replacement has not yet been...
Read more >arXiv identifier scheme - information for interacting services
This includes archives where the identifier has optional subject-class information ( math , cs , nlin , q-bio ), archives where the subject- ......
Read more >Considerations for TeX Submissions | arXiv e-print repository
All TeX-type submissions receive the arXiv watermark, including the canonical identifier, version number, primary classification, and a link ...
Read more >Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start FreeTop Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Top GitHub Comments
@sidphbot patch is included in 1.4.0.
Aha! It looks like old-form IDs can be requested; they just need to be fully-qualified with the archive and (where applicable) subject class.
Explanation
The old-form arXiv ID is a combination of a subject component, a date component, and a counter component.
0112019
is the019
th paper submitted on the12
th month of 2001
… but, because the counts are archive-specific, the numeric component isn’t unique. There is a0112019
in quantum physics, but there may also be a0112019
in astrophysics and a0112019
in math.This old format only uniquely identifies a paper if we specify which archive’s count it refers to. In this case, we want
quant-ph
The fully-qualified ID for
0112019
isquant-ph/0112019
. Accordingly, the following code works:But the short ID reported by this client library is incorrect:
Instead of just taking the last path element here, I should be taking the full contents of the path following
http://arxiv.org/abs/
:https://github.com/lukasschwab/arxiv.py/blob/ea93efa9f369da995f657856447f4ad998f9076f/arxiv/arxiv.py#L169-L176
@sidphbot if you’re working from hardcoded IDs, adding the archives should solve this issue for you.
If you’re re-querying incorrect IDs returned by this client library, I’ll have a patch out shortly.