PDF corrupt when downloaded with getpapers, O.K. when downloaded directly
See original GitHub issueI am totally new to getpapers, so this may be just a misunderstanding on my part of some of the inner workings of the ContentMine toolchain (even though it seems to be pretty straightforward…).
Here is the problem: when I use getpapers to download a paper, say with the command:
getpapers --query 'PMCID:"PMC5293196" JOURNAL:"PLOS ONE"' --outdir network-analysis -p
I get:
info: Searching using eupmc API
info: Found 1 open access results
warn: This version of getpapers wasn't built with this version of the EuPMC api in mind
warn: getpapers EuPMCVersion: 4.5.3.2 vs. 5.0.1 reported by api
Retrieving results [==============================] 100% (eta 0.0s)
info: Done collecting results
info: Saving result metadata
info: Full EUPMC result metadata written to eupmc_results.json
info: Individual EUPMC result metadata records written
info: Extracting fulltext HTML URL list (may not be available for all articles)
info: Fulltext HTML URL list written to eupmc_fulltext_html_urls.txt
info: Downloading fulltext PDF files
Downloading files [==============================] 100% (1/1) [0.0s elapsed, eta 0.0]
info: All downloads succeeded!
which looks pretty good to me (the warnings seem to be harmless in my case). Everything is there and the PDF seems to be in
PMC5293196/fulltext.pdf
However, when I try to open it in acroread, the reader tells me that the file is corrupt, that some font is missing etc. - and the paper is practically unreadable!
If I try to download it directly from the URL
http://europepmc.org/articles/PMC5293196?pdf=render
which is found in the eupmc_results.json as the download URL for this ID, all is OK and the paper is displayed with absolutely no errors…
There is indeed a difference in the MD5 sums of the two versions:
md5sum PMC5293196/fulltext.pdf
b815dff20f24ddbb035b4bfc3743d7f5 PMC5293196/fulltext.pdf
md5sum PMC5293196.pdf
5771a39765664bedb8ad8815b7980921 PMC5293196.pdf
The latter (PMC5293196.pdf) is the ‘good’ one, downloaded directly from http://europepmc.org/articles/PMC5293196?pdf=render with curl, the former (PMC5293196/fulltext.pdf) is the ‘bad’ one downloaded by getpapers.
It is not a locale issue (seems to me), because the same happens to a user with
LANG=en_US
LC_CTYPE="en_US"
LC_NUMERIC="en_US"
LC_TIME="en_US"
LC_COLLATE=C
and to a user with a non-US locale.
I have nodejs 4.4.6 and getpapers is installed with
npm install --global getpapers
npm WARN deprecated node-uuid@1.4.7: use uuid module instead
npm WARN deprecated tough-cookie@2.2.2: ReDoS vulnerability parsing Set-Cookie https://nodesecurity.io/advisories/130
/usr/bin/getpapers -> /usr/lib/node_modules/getpapers/bin/getpapers.js
getpapers@0.4.12 /usr/lib/node_modules/getpapers
├── progress@1.1.8
├── version_compare@0.0.3
├── commander@2.7.1 (graceful-readlink@1.0.1)
├── chalk@1.0.0 (escape-string-regexp@1.0.5, ansi-styles@2.2.1, supports-color@1.3.1, strip-ansi@2.0.1, has-ansi@1.0.3)
├── mkdirp@0.5.1 (minimist@0.0.8)
├── sanitize-filename@1.6.1 (truncate-utf8-bytes@1.0.2)
├── got@2.9.2 (lowercase-keys@1.0.0, timed-out@2.0.0, prepend-http@1.0.4, object-assign@2.1.1, is-stream@1.1.0, infinity-agent@2.0.3, statuses@1.3.1, nested-error-stacks@1.0.2, read-all-stream@2.2.0, duplexify@3.5.0)
├── matched@0.4.4 (fs-exists-sync@0.1.0, is-valid-glob@0.3.0, arr-union@3.1.0, async-array-reduce@0.2.1, extend-shallow@2.0.1, has-glob@0.1.1, glob@7.1.1, lazy-cache@2.0.2, resolve-dir@0.1.1)
├── winston@1.0.2 (cycle@1.0.3, stack-trace@0.0.9, eyes@0.1.8, isstream@0.1.2, async@1.0.0, pkginfo@0.3.1, colors@1.0.3)
├── restler@3.4.0 (yaml@0.2.3, qs@1.2.0, iconv-lite@0.2.11, xml2js@0.4.0)
├── lodash@3.10.1
├── xml2js@0.4.17 (sax@1.2.2, xmlbuilder@4.2.1)
├── requestretry@1.12.0 (extend@3.0.0, when@3.7.8, request@2.80.0, lodash@4.17.4)
└── crossref@0.1.2 (got@5.1.0, request@2.65.0)
What is going on here? Can you please shed some light?
Issue Analytics
- State:
- Created 7 years ago
- Comments:26 (12 by maintainers)
Top GitHub Comments
PMC5293196.pdf fulltext.pdf
I attach the two versions (“good” and “bad”) for PMC5293196.
PMC5293196.pdf: the good one, i.e. the one you get with curl. fulltext.pdf: the bad one, i.e. the one you get with getpapers.
If you open the two files in a pure text editor (say, in vi), you will notice that the differences are in parts like stream objects, i.e. in the binary parts. This explains why the PLOS icon does not appear in okular with the bad file. It is something that has to do with the encoding of the binary blobs of a PDF file.
Besides, the bad file is almost double in size, indicating that its binary parts have been encoded in a “bloating” way - base64?
I think I’m pretty close…Now it’s your turn, dear @blahah ! 😃
@blahah , Indeed, it works! 👍 Something new to learn, thank you! 😃
I derived my version from requestretry documentation, where it says:
From that I gathered that the form
requestretry.get(url, options, something)
should be a valid one. Obviously, for something=‘fullResponse: false’ it is NOT (and I admit that ‘something’ like that does not look like a callback 😉).
Curiously, I did not get an error - it just failed in a subtle way: it downloaded the PDF, failing to get the options object - resulting in a corrupt PDF. For someone like me who is used to get compiler warnings for hair-splitting ‘issues’, this is more than weird…
@tarrow, I suggest you use blahah’s version. It looks cleaner to me to have an options object where all options are collected and pass that one to requestretry, instead of giving it extra options for user agent, encoding etc. each time it is called. Other than that, both versions resolve the bug for me.
Thank you all. 😃