Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

`get_short_id` incorrect for pre-March 2007 arXiv identifiers: missing archive

See original GitHub issue

Error:

/usr/local/lib/python3.7/dist-packages/arxiv/arxiv.py in results(self, search)
    552             ))
    553             page_url = self._format_url(search, offset, page_size)
--> 554             feed = self._parse_feed(page_url, first_page)
    555             if first_page:
    556                 # NOTE: this is an ugly fix for a known bug. The totalresults

/usr/local/lib/python3.7/dist-packages/arxiv/arxiv.py in _parse_feed(self, url, first_page)
    635         # Feed was never returned in self.num_retries tries. Raise the last
    636         # exception encountered.
--> 637         raise err
    638 
    639 

HTTPError: arxiv.HTTPError(Page request resulted in HTTP 400)

Code for parsing id from arxiv result object- id = urlparse(result.entry_id).path.split('/')[-1].split('v')[0]

code to reproduce -

ids = ['1911.10854', '1905.00256', '0112019', '1202.2184', '1708.03109', '0205137', '1610.08147', '2003.05245', '0406182', '0708.3630', '0503148', '1111.6170', '1612.04479', '0307110', '0306127', '1307.2727', '0402059', '1012.4706', '1906.01999', '0101032']

papers = arxiv.Search(id_list=ids).get()

invalid ids are '0112019', '0205137' etc

respective pdf urls still accessible, for example : https://arxiv.org/pdf/quant-ph/0112019.pdf

The same error is referenced in another open issue but from the perspective of huge id arrays. [ issue ID : #15]

Apologies if I simply lack sufficient knowledge about identifier naming conventions but it should download from all research fields right?

Issue Analytics

State:
Created 2 years ago
Comments:5 (4 by maintainers)

Top GitHub Comments

1reaction

lukasschwabcommented, Jul 13, 2021

@sidphbot patch is included in 1.4.0.

1reaction

lukasschwabcommented, Jul 13, 2021

Aha! It looks like old-form IDs can be requested; they just need to be fully-qualified with the archive and (where applicable) subject class.

Explanation

The old-form arXiv ID is a combination of a subject component, a date component, and a counter component.

Diagram breaking down the old-form arXiv ID into its components

0112019 is the 019th paper submitted on the 12th month of 2001… but, because the counts are archive-specific, the numeric component isn’t unique. There is a 0112019 in quantum physics, but there may also be a 0112019 in astrophysics and a 0112019 in math.

This old format only uniquely identifies a paper if we specify which archive’s count it refers to. In this case, we want quant-ph

The fully-qualified ID for 0112019 is quant-ph/0112019. Accordingly, the following code works:

>>> import arxiv
>>> next(arxiv.Search(id_list=['quant-ph/0112019v1']).results())
[arxiv.Result(entry_id='http://arxiv.org/abs/quant-ph/0112019v1', updated=datetime.datetime(2001, 12, 4, 2, 54, 2, tzinfo=datetime.timezone.utc), published=datetime.datetime(2001, 12, 4, 2, 54, 2, tzinfo=datetime.timezone.utc), title='Classical entanglement', authors=[arxiv.Result.Author('Douglas G. Danforth')], summary='Classical systems can be entangled. Entanglement is defined by coincidence\ncorrelations. Quantum entanglement experiments can be mimicked by a mechanical\nsystem with a single conserved variable and 77.8% conditional efficiency.\nExperiments are replicated for four particle entanglement swapping and GHZ\nentanglement.', comment=None, journal_ref=None, doi=None, primary_category='quant-ph', categories=['quant-ph'], links=[arxiv.Result.Link('http://arxiv.org/abs/quant-ph/0112019v1', title=None, rel='alternate', content_type=None), arxiv.Result.Link('http://arxiv.org/pdf/quant-ph/0112019v1', title='pdf', rel='related', content_type=None)])]

But the short ID reported by this client library is incorrect:

>>> r = next(arxiv.Search(id_list=['quant-ph/0112019v1']).results())
>>> r.entry_id
'http://arxiv.org/abs/quant-ph/0112019v1'

Instead of just taking the last path element here, I should be taking the full contents of the path following http://arxiv.org/abs/:

https://github.com/lukasschwab/arxiv.py/blob/ea93efa9f369da995f657856447f4ad998f9076f/arxiv/arxiv.py#L169-L176

@sidphbot if you’re working from hardcoded IDs, adding the archives should solve this issue for you.

If you’re re-querying incorrect IDs returned by this client library, I’ll have a patch out shortly.

Top Results From Across the Web

Understanding the arXiv identifier | arXiv e-print repository

Identifiers up to March 2007 (9107-0703) Instead, each archive represents a subject class, e.g., hep-ex, hep-lat, hep-ph, and hep-th. The astro ...

Author Identifiers | arXiv e-print repository

Author Identifiers. It is a long-term goal of arXiv to accurately identify and disambiguate all authors of all articles in arXiv.

To replace an article | arXiv e-print repository

We ask that articles be replaced no more than once per week. Note that if your article or replacement has not yet been...

arXiv identifier scheme - information for interacting services

This includes archives where the identifier has optional subject-class information ( math , cs , nlin , q-bio ), archives where the subject- ......

Considerations for TeX Submissions | arXiv e-print repository

All TeX-type submissions receive the arXiv watermark, including the canonical identifier, version number, primary classification, and a link ...

Troubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.

Start Free

Top Related Reddit Thread

No results found

Top Related Tweet

No results found

Top Related Dev.to Post

No results found

`get_short_id` incorrect for pre-March 2007 arXiv identifiers: missing archive

Issue Analytics

Top GitHub Comments

Top Results From Across the Web

Top Related Medium Post

Top Related StackOverflow Question

Troubleshoot Live Code

Top Related Reddit Thread

Top Related Hackernoon Post

Top Related Tweet

Top Related Dev.to Post

Top Related Hashnode Post

UnexpectedEmptyPageError at abrupt intervals

Unreliable results: pages from API are unexpectedly empty