Intermittent 404 errors to client due to upstream DNS lookup failure (getaddrinfo EAI_AGAIN)
See original GitHub issueDescribe the bug
We see intermittent/random 404 responses coming from Verdaccio back to the NPM client for packages that are proxied from registry.npmjs.org.
When this happens, the Verdaccio logs look similar to this, but the particular packages that have errors change each time:
[2019-09-26 20:45:10] info <-- 127.0.0.1 requested 'GET /is-promise'
[2019-09-26 20:45:10] info --> making request: 'GET https://registry.npmjs.org/is-promise'
[2019-09-26 20:45:10] http --> ERR, req: 'GET https://registry.npmjs.org/is-promise', error: getaddrinfo EAI_AGAIN registry.npmjs.org registry.npmjs.org:443
[2019-09-26 20:45:10] http <-- 404, user: null(127.0.0.1), req: 'GET /is-promise', error: no such package available
Only a small fraction of the total requests fail, with many other requests within the same second (both before and after the failed request) completing successfully (including DNS resolution to registry.npmjs.org).
I believe there are two related issues here:
-
An
EAI_AGAIN
failure fromgetaddrinfo
is a transient error, so Verdaccio should respond with a 5xx HTTP error (probably 503) to the client, not a 404 error, so that npm/yarn will retry the failure instead of immediately terminating with an error. -
I believe that Verdaccio may be indirectly causing the
getaddrinfo
EAI_AGAIN
error because it causes a very large number of calls to the NodeJS DNS library in a very short period of time, which may be resulting in some (still unknown) resource issue. (Examples of this type of resource issue could be either within NodeJS, like too few libuv threads for DNS lookup, a process-level issue like too many open file descriptors, or some container or system issue like ephemeral port exhaustion for DNS queries).
To Reproduce
We have seen this very consistently in our CI runs for AdaptJS. But CI for Adapt is fairly complicated, so isolating this to a simple set of steps has not been possible. However, because it is easy to reproduce in CI, we can easily test possible fixes.
Some info about the Adapt CI process and the error that might be useful:
- Verdaccio version 4.3.1
- All the CI tests run from inside a single Docker container and are driven by Mocha.
- Verdaccio is started from Mocha, inside that same container (not in its own container).
- We run sets of tests in parallel to shorten CI runtime. This means that the system load (CPU, disk I/O, and network I/O) tends to be quite high during testing, but there is some variability and timing differences from run to run.
- CI always starts with empty NPM and yarn caches and empty Verdaccio storage.
- We publish private versions of the Adapt packages (namespace @adpt) to Verdaccio and proxy everything else to registry.npmjs.org
- The test that consistently fails is the first time we do a global install of the Adapt CLI package inside the container (‘npm install -g --registry http://localhost:PORT @adpt/cli’ where PORT is a dynamically chosen port for Verdaccio). This is also the first time that any NPM install is done, so the NPM cache is still empty and Verdaccio’s storage is empty, which results in Verdaccio fetching a large number of packages from the public NPM registry all at once.
Expected behavior
-
Clients (NPM/yarn) should receive HTTP 5xx errors due to DNS
EAI_AGAIN
errors received while looking up the upstream registry address, not 404 errors. -
Verdaccio should not cause excessively high rates of DNS queries for a single upstream host.
Screenshots
Complete Verdaccio log file, showing the 404 errors: local-registry-20190926-204410.531.log
Because this is reproducible, I can supply as many similar log files as needed.
Docker || Kubernetes (please complete the following information):
- Docker verdaccio tag: N/A
- Docker commands N/A
- Docker Version: 18.06.1-ce (Server) 19.03.2 (Client)
Configuration File (cat ~/.config/verdaccio/config.yaml)
We’re using the API, so here’s the JS object:
{
storage: <dynamically created directory just for this verdaccio instance>,
auth: {
htpasswd: {
file: path.join(verdaccioDir, "htpasswd")
}
},
uplinks: {
npmjs: {
url: "https://registry.npmjs.org/",
max_fails: 20,
timeout: "5s",
fail_timeout: "1s",
}
},
packages: {
"@adpt/*": {
access: "$all",
publish: "$all",
},
"**": {
access: "$all",
publish: "$all",
proxy: "npmjs"
},
},
logs: [
{ type: "file", format: "pretty-timestamped", level: "debug", path: <dynamic log filename>}
],
}
Environment information
Verdaccio version: 4.3.1
Environment Info:
System:
OS: Linux 4.14 Debian GNU/Linux 9 (stretch) 9 (stretch)
CPU: (4) x64 Intel(R) Xeon(R) CPU E5-2686 v4 @ 2.30GHz
Binaries:
Node: 10.15.3 - /usr/local/bin/node
Yarn: 1.13.0 - /usr/local/bin/yarn
npm: 6.4.1 - /usr/local/bin/npm
Virtualization:
Docker: 19.03.2 - /usr/bin/docker
Debugging output
See attached log file for complete logs. Important log details excerpted above.
Additional context
After quite a bit of troubleshooting and debugging, I believe that Verdaccio is indirectly causing a resource issue due to how it handles requests to the upstream server, by opening a new TCP connection (and doing a DNS lookup) for every upstream request individually.
The fix I’d like to propose is to use HTTP keepalive on the upstream requests, which will re-use those TCP connections to the upstream server multiple times, thus reducing the number of DNS lookups (and TCP setups and TLS negotiations) that Verdaccio does.
In our testing, enabling HTTP keepalive for upstream requests in Verdaccio resolves this issue. I’d be happy to submit a PR for this, if you’d like.
Issue Analytics
- State:
- Created 4 years ago
- Comments:7 (4 by maintainers)
Top GitHub Comments
I was just looking at this week’s commits to master. It looks like agent options are now supported in 4.3.3, via #1332. That’s great! 😄
So that should address the request for adding HTTP keepalive. I will do some testing with 4.3.3 to confirm.
The issue with returning 404 in response to the transient error EAI_AGAIN still remains, as far as I know.
I’ve enabled keep-alive via configuration by default, at the next major
v4.11.0
in late December, I’ll enable it via code if nobody reports any issue (I don’t expect any but just being careful) 👍