feature request - address the 429 "too many requests"
See original GitHub issueSites like Wikipedia throttle the incoming requests, yielding 429
error:

It may or may not be a broken link.
Idea: what if we improved the algorithm to slow down for 429
throttling domains and tackle all 429
links in a separate, second round, per-domain but slower?
Imagine, our outgoing requests go as normal, respecting --concurrency
value, but when it completes, it extracts all 429
errors, groups them per throttling domain, waits a bit, then slowly tackles each link let’s say 1 query per second (or slower), concurrently, per-throttling domain.
Currently…
For example, I’ve got 1042 links and 11 of them linking to Wikipedia. If I set --concurrency
to satisfy Wikipedia, let’s say 2 seconds per request, it will take 1042 × 2 / 60 = 34 minutes — unbearable, considering it’s for 1% of the links!
If we implemented the feature, it would be 1031 × 0.01 + 11*2 = 32 seconds. Reasonable, considering current 100 req/sec. throttle takes 1042 × 0.01 = 10 seconds.
I can exclude Wikipedia via --skip
, but we can automate this, can’t we?
What do you think?
Issue Analytics
- State:
- Created 3 years ago
- Reactions:1
- Comments:6 (2 by maintainers)
PS. github / npm / wikipedia link checks do work on the latest; I’m using linkinator
linkinator ./dist --recurse --concurrency 1
— if anybody is still having429
problems, limit the concurrency to one. Thank you Justin! 👍It’s a really good point - right now the crawler is a tad aggressive 😃 Another potential idea on how to handle this one - I suspect most services that return an HTTP 429 may also return a
Retry-After
header: https://developer.mozilla.org/en-US/docs/Web/HTTP/Headers/Retry-AfterWhen that header is detected we could add the request to a queue that is specific to the subdomain, and then drain it in accordance with the retry guidance coming back from results. It sounds like a lot of fun to build 😃