question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

PDF corrupt when downloaded with getpapers, O.K. when downloaded directly

See original GitHub issue

I am totally new to getpapers, so this may be just a misunderstanding on my part of some of the inner workings of the ContentMine toolchain (even though it seems to be pretty straightforward…).

Here is the problem: when I use getpapers to download a paper, say with the command:

getpapers --query 'PMCID:"PMC5293196" JOURNAL:"PLOS ONE"' --outdir network-analysis -p

I get:

info: Searching using eupmc API
info: Found 1 open access results
warn: This version of getpapers wasn't built with this version of the EuPMC api in mind
warn: getpapers EuPMCVersion: 4.5.3.2 vs. 5.0.1 reported by api
Retrieving results [==============================] 100% (eta 0.0s)
info: Done collecting results
info: Saving result metadata
info: Full EUPMC result metadata written to eupmc_results.json
info: Individual EUPMC result metadata records written
info: Extracting fulltext HTML URL list (may not be available for all articles)
info: Fulltext HTML URL list written to eupmc_fulltext_html_urls.txt
info: Downloading fulltext PDF files
Downloading files [==============================] 100% (1/1) [0.0s elapsed, eta 0.0]
info: All downloads succeeded!

which looks pretty good to me (the warnings seem to be harmless in my case). Everything is there and the PDF seems to be in

PMC5293196/fulltext.pdf

However, when I try to open it in acroread, the reader tells me that the file is corrupt, that some font is missing etc. - and the paper is practically unreadable!

If I try to download it directly from the URL

http://europepmc.org/articles/PMC5293196?pdf=render

which is found in the eupmc_results.json as the download URL for this ID, all is OK and the paper is displayed with absolutely no errors…

There is indeed a difference in the MD5 sums of the two versions:

md5sum PMC5293196/fulltext.pdf 
b815dff20f24ddbb035b4bfc3743d7f5  PMC5293196/fulltext.pdf
md5sum PMC5293196.pdf
5771a39765664bedb8ad8815b7980921  PMC5293196.pdf

The latter (PMC5293196.pdf) is the ‘good’ one, downloaded directly from http://europepmc.org/articles/PMC5293196?pdf=render with curl, the former (PMC5293196/fulltext.pdf) is the ‘bad’ one downloaded by getpapers.

It is not a locale issue (seems to me), because the same happens to a user with

LANG=en_US
LC_CTYPE="en_US"
LC_NUMERIC="en_US"
LC_TIME="en_US"
LC_COLLATE=C

and to a user with a non-US locale.

I have nodejs 4.4.6 and getpapers is installed with

npm install --global getpapers
npm WARN deprecated node-uuid@1.4.7: use uuid module instead
npm WARN deprecated tough-cookie@2.2.2: ReDoS vulnerability parsing Set-Cookie https://nodesecurity.io/advisories/130
/usr/bin/getpapers -> /usr/lib/node_modules/getpapers/bin/getpapers.js
getpapers@0.4.12 /usr/lib/node_modules/getpapers
├── progress@1.1.8
├── version_compare@0.0.3
├── commander@2.7.1 (graceful-readlink@1.0.1)
├── chalk@1.0.0 (escape-string-regexp@1.0.5, ansi-styles@2.2.1, supports-color@1.3.1, strip-ansi@2.0.1, has-ansi@1.0.3)
├── mkdirp@0.5.1 (minimist@0.0.8)
├── sanitize-filename@1.6.1 (truncate-utf8-bytes@1.0.2)
├── got@2.9.2 (lowercase-keys@1.0.0, timed-out@2.0.0, prepend-http@1.0.4, object-assign@2.1.1, is-stream@1.1.0, infinity-agent@2.0.3, statuses@1.3.1, nested-error-stacks@1.0.2, read-all-stream@2.2.0, duplexify@3.5.0)
├── matched@0.4.4 (fs-exists-sync@0.1.0, is-valid-glob@0.3.0, arr-union@3.1.0, async-array-reduce@0.2.1, extend-shallow@2.0.1, has-glob@0.1.1, glob@7.1.1, lazy-cache@2.0.2, resolve-dir@0.1.1)
├── winston@1.0.2 (cycle@1.0.3, stack-trace@0.0.9, eyes@0.1.8, isstream@0.1.2, async@1.0.0, pkginfo@0.3.1, colors@1.0.3)
├── restler@3.4.0 (yaml@0.2.3, qs@1.2.0, iconv-lite@0.2.11, xml2js@0.4.0)
├── lodash@3.10.1
├── xml2js@0.4.17 (sax@1.2.2, xmlbuilder@4.2.1)
├── requestretry@1.12.0 (extend@3.0.0, when@3.7.8, request@2.80.0, lodash@4.17.4)
└── crossref@0.1.2 (got@5.1.0, request@2.65.0)

What is going on here? Can you please shed some light?

Issue Analytics

  • State:closed
  • Created 7 years ago
  • Comments:26 (12 by maintainers)

github_iconTop GitHub Comments

1reaction
sedimentation-faultcommented, Apr 9, 2017

PMC5293196.pdf fulltext.pdf

I attach the two versions (“good” and “bad”) for PMC5293196.

PMC5293196.pdf: the good one, i.e. the one you get with curl. fulltext.pdf: the bad one, i.e. the one you get with getpapers.

If you open the two files in a pure text editor (say, in vi), you will notice that the differences are in parts like stream objects, i.e. in the binary parts. This explains why the PLOS icon does not appear in okular with the bad file. It is something that has to do with the encoding of the binary blobs of a PDF file.

Besides, the bad file is almost double in size, indicating that its binary parts have been encoded in a “bloating” way - base64?

I think I’m pretty close…Now it’s your turn, dear @blahah ! 😃

0reactions
sedimentation-faultcommented, Apr 13, 2017

@blahah , Indeed, it works! 👍 Something new to learn, thank you! 😃

I derived my version from requestretry documentation, where it says:

request.get(url, options [, callback]) - same as request(options [, callback]), defaults options.method to GET.

From that I gathered that the form

requestretry.get(url, options, something)

should be a valid one. Obviously, for something=‘fullResponse: false’ it is NOT (and I admit that ‘something’ like that does not look like a callback 😉).

Curiously, I did not get an error - it just failed in a subtle way: it downloaded the PDF, failing to get the options object - resulting in a corrupt PDF. For someone like me who is used to get compiler warnings for hair-splitting ‘issues’, this is more than weird…

@tarrow, I suggest you use blahah’s version. It looks cleaner to me to have an options object where all options are collected and pass that one to requestretry, instead of giving it extra options for user agent, encoding etc. each time it is called. Other than that, both versions resolve the bug for me.

Thank you all. 😃

Read more comments on GitHub >

github_iconTop Results From Across the Web

6 Reasons Why Your PDF File Can Become Corrupted
When downloading a PDF file, there are some errors that could occur, corrupting the document. One of these errors is an unstable internet...
Read more >
Downloaded PDF in R - Corrupted File - Stack Overflow
I've been trying to download some pdfs in R, with the following code: ... Everything goes fine until I open the pdf file...
Read more >
Re: When downloading, says file corrupt - 11702644
When downloading, says file corrupt · 1- Make sure that you have the most recent version of Acrobat Reader installed. · 2- Try...
Read more >
PDFs are downloading via imap with the wrong file size and ...
Any time I try to view a pdf attachment, the attachment wont open and is reported as corrupted. I downloaded a pdf directly...
Read more >
Web Direct export PDF file corrupted - Claris Community
The same script in Web Direct export the file, but Acrobat or any other PDF reader complain ... 3- Test FMP - OK...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found