Enable user to use .export for PDF download
See original GitHub issueMotivation
The arxiv library uses the .export.arxiv.org subdomain for querying a paper, but downloads the paper directly from arxiv.org. This can result in the problem that the user gets blocked from arxiv, when downloading too many papers.
Solution
A solution would be to modify the paper PDF url to point to the corresponding .export subdomain. In the code for my personal use I simply use:
idx = paper.pdf_url.index('arxiv')
paper.pdf_url = paper.pdf_url[:idx] + 'export.' + paper.pdf_url[idx:]
where paper is a Result
instance. This solution is lacking though, since the export subdomain does not have to exist. This would need to be checked. I would add this functionality into the _get_pdf_url
method. A boolean flag user_export
could be introduced, if some users wish to download directy from arxiv.org, even though it is not adviced according to: https://arxiv.org/help/bulk_data under the “Play Nice” section.
Issue Analytics
- State:
- Created 2 years ago
- Comments:5 (3 by maintainers)
Top GitHub Comments
Interesting, sorry about the bad assumption, I didn’t realize this used the export site. That’s even more perplexing, then. And no I didn’t call download_pdf 300k times. I got 403 after attempting to do
results = arxiv.Search(query="cat:cs.LG").results()
I can open separate PR.
@brandonrobertz No worries! Happy to advise.