question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

HTTP error 429 - Too Many Requests

See original GitHub issue

Although I pledged in https://github.com/ContentMine/getpapers/issues/156 to resist the temptation of opening a new bug for each and every HTTP error I encounter, this one happens so often that it deserves special attention.

What happened so far

In my attempt to avoid the showstopper ECONNRESET error (see https://github.com/ContentMine/getpapers/issues/155), I applied my workaround described in https://github.com/ContentMine/getpapers/issues/152 to let my own curl wrapper do the work:

I commented the original code in /usr/lib/node_modules/getpapers/lib/download.js:

// //   rq = requestretry.get({url: url,
// //                    fullResponse: false,
// //                    headers: {'User-Agent': config.userAgent},
// //                    encoding: null
// //                   });
//   rq = requestretry.get(Object.assign({url: url, fullResponse: false}, options));
//   rq.then(handleDownload)
//   rq.catch(throwErr)

and appended this:

  // Alternative method: use 'exec' to run 'mycurl -o ...'
  // Compose the mycurl command
  var mycurl = 'mycurl -o \'' + base + rename + '\' \'' + url + '\'';
  log.debug('Executing: ' + mycurl);
  // excute mycurl using child_process' exec function
  var child = exec(mycurl, function(err, stdout, stderr) {
      // if (err) throw err;
      if (err) {
        log.error(err);
      }
      // else console.log(rename + ' downloaded to ' + base);
      else {
        // log.info(stdout);
        console.log(stdout);
        log.debug(rename + ' downloaded to ' + base);
      }
  });
  nextUrlTask(urlQueue);

Here, mycurl is just my own curl wrapper - it catches curl errors and implements various strategies depending on the error, the server, my daily mood and other obscure factors. 😉

NOTE: You will also need to add something like

// Commented. Has issues with unhandled ECONNRESET errors.
// var requestretry = require('requestretry')
var exec = require('child_process').exec

at the top of download.js.

The problem now

My above ‘hack around’ (as @tarrow calls it in https://github.com/ContentMine/getpapers/issues/152) works smoothly - but every now and then (like every 10 downloads or so), it catches a 429 Too Many Requests error:

curl: (22) The requested URL returned error: 429 Too Many Requests
(curl --location --fail --progress-bar --connect-timeout 100 --max-time 300 -C - -o PMC3747277/fulltext.pdf http://europepmc.org/articles/PMC3747277?pdf=render)

    at ChildProcess.exithandler (child_process.js:206:12)
    at emitTwo (events.js:106:13)
    at ChildProcess.emit (events.js:191:7)
    at maybeClose (internal/child_process.js:877:16)
    at Socket.<anonymous> (internal/child_process.js:334:11)
    at emitOne (events.js:96:13)
    at Socket.emit (events.js:188:7)
    at Pipe._handle.close [as _onclose] (net.js:498:12)

My curl wrapper catches this and it indeed retries a few times - but it seems that a more elaborate strategy is needed (most notably: a longer sleep interval between retries). The frequency of this error indicates that getpapers is hammering the server too fast.

I have not seen any way to throttle (the keyword phrase associated with error 429 is “rate limit”) requests from getpapers. I therefore strongly suggest to introduce such an option - otherwise, the user has to run the script multiple times, not knowing for sure whether subsequent runs will correct failed downloads of previous runs (see https://github.com/ContentMine/getpapers/issues/156).

Issue Analytics

  • State:closed
  • Created 6 years ago
  • Comments:26 (10 by maintainers)

github_iconTop GitHub Comments

2reactions
sedimentation-faultcommented, Apr 20, 2017

The synchronous curl-wrapper workaround above works like a charm - it’s been 1.5 days running, has handled all kinds of HTTP errors gracefully, is at 70% and still going! I suggest it as a temporary (or even permanent) solution to HTTP errors that getpapers cannot (yet) handle to anybody, as well as a hack that can help in getting more information about inner workings of the HTTP connection to the developers.

Thank you all for you tips and great help! 👍

1reaction
petermrcommented, Apr 21, 2017

Thanks both, I think that addressing some of this energy and technology to quickscrape would be really valuable.
getpapers is a tool to maximize the efficiency of extracting content from willing organizations. There are an increasing number of good players who expose APIs and want people to use them responsibly. (I’ve been on the Project Advisory Board or EuropePMC for 10 years and seen this from the other side - they aim to support high volumes of downloads and we work with them. @tarrow frequently contacts them with problems and they respect this and respond. Note that I and other ContentMine community have frequent contact with many repositories (arXiv, HAL, CORE, etc.) and work with them to resolve problems. But as @blahah says, it’s underfunded compared with the investment that rich publishers make in non-open systems.

By contrast quickscrape aims to scrape web pages to which the user has legal access (I stress this). Many publishers do not provide an API and some that do have unacceptable terms and conditions. quickscrape has been designed to take a list of URLs (or resolved DOIs) and download the content from a web site. This should only be done when you believe this is legal. The problem is that the sites often use dynamic HTML / Javascript, contain lots of “Publisher Junk” and change frequently. If you have a list of (say) 1000 URLs then it may well contain 50 different publishers. There is a generic scraper which works well for many, but for some it’s necessary to write bespoke scrapers.

A typical (and valuable) use of quickscrape is in conjunction with Crossref (who we are friends with). Crossref contains metadata from publishers (often messy) and the ability to query, but does not itself have the full text. So a typical workflow (which I spent a lot of time runnig last year) is :

  • query Crossref using getpapers. This returns a list of URLs
  • pass the URLs to quickscrape and download the papers you are legally entitled to.

This is really valuable for papers which are not in a repository. It’s a very messy business as there are frequent “hangs” and unexpected output or none. @tarrow worked hard to improve it but there is still a lot of work to be done.

If you are interested in this PLEASE liaise with @blahah - he wrote it and knows many of the issues.

Read more comments on GitHub >

github_iconTop Results From Across the Web

How to Fix 429 Too Many Requests Error - Kinsta
The HTTP 429 error is returned when a user has sent too many requests within a short period of time. The 429 status...
Read more >
How to Fix 429 Too Many Requests Error Code - Hostinger
As mentioned before, the HTTP 429 error happens when the server detects too many attempts within a short period of time. This can...
Read more >
429 Error – Too Many Requests HTTP Code Explained
The 429 error is an HTTP status code. It tells you when the use of an internet resource has surpassed the number of...
Read more >
How to avoid HTTP error 429 (Too Many Requests) python
Receiving a status 429 is not an error, it is the other server "kindly" asking you to please stop spamming requests.
Read more >
What an HTTP Error 429 Means & How to Fix It - HubSpot Blog
HTTP Error 429 is an HTTP response status code that indicates the client application has surpassed its rate limit, or number of requests...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found