question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Intermittent 404 errors to client due to upstream DNS lookup failure (getaddrinfo EAI_AGAIN)

See original GitHub issue

Describe the bug

We see intermittent/random 404 responses coming from Verdaccio back to the NPM client for packages that are proxied from registry.npmjs.org.

When this happens, the Verdaccio logs look similar to this, but the particular packages that have errors change each time:

[2019-09-26 20:45:10]  info <-- 127.0.0.1 requested 'GET /is-promise'
[2019-09-26 20:45:10]  info --> making request: 'GET https://registry.npmjs.org/is-promise'
[2019-09-26 20:45:10]  http --> ERR, req: 'GET https://registry.npmjs.org/is-promise', error: getaddrinfo EAI_AGAIN registry.npmjs.org registry.npmjs.org:443
[2019-09-26 20:45:10]  http <-- 404, user: null(127.0.0.1), req: 'GET /is-promise', error: no such package available

Only a small fraction of the total requests fail, with many other requests within the same second (both before and after the failed request) completing successfully (including DNS resolution to registry.npmjs.org).

I believe there are two related issues here:

  1. An EAI_AGAIN failure from getaddrinfo is a transient error, so Verdaccio should respond with a 5xx HTTP error (probably 503) to the client, not a 404 error, so that npm/yarn will retry the failure instead of immediately terminating with an error.

  2. I believe that Verdaccio may be indirectly causing the getaddrinfo EAI_AGAIN error because it causes a very large number of calls to the NodeJS DNS library in a very short period of time, which may be resulting in some (still unknown) resource issue. (Examples of this type of resource issue could be either within NodeJS, like too few libuv threads for DNS lookup, a process-level issue like too many open file descriptors, or some container or system issue like ephemeral port exhaustion for DNS queries).

To Reproduce

We have seen this very consistently in our CI runs for AdaptJS. But CI for Adapt is fairly complicated, so isolating this to a simple set of steps has not been possible. However, because it is easy to reproduce in CI, we can easily test possible fixes.

Some info about the Adapt CI process and the error that might be useful:

  • Verdaccio version 4.3.1
  • All the CI tests run from inside a single Docker container and are driven by Mocha.
  • Verdaccio is started from Mocha, inside that same container (not in its own container).
  • We run sets of tests in parallel to shorten CI runtime. This means that the system load (CPU, disk I/O, and network I/O) tends to be quite high during testing, but there is some variability and timing differences from run to run.
  • CI always starts with empty NPM and yarn caches and empty Verdaccio storage.
  • We publish private versions of the Adapt packages (namespace @adpt) to Verdaccio and proxy everything else to registry.npmjs.org
  • The test that consistently fails is the first time we do a global install of the Adapt CLI package inside the container (‘npm install -g --registry http://localhost:PORT @adpt/cli’ where PORT is a dynamically chosen port for Verdaccio). This is also the first time that any NPM install is done, so the NPM cache is still empty and Verdaccio’s storage is empty, which results in Verdaccio fetching a large number of packages from the public NPM registry all at once.

Expected behavior

  1. Clients (NPM/yarn) should receive HTTP 5xx errors due to DNS EAI_AGAIN errors received while looking up the upstream registry address, not 404 errors.

  2. Verdaccio should not cause excessively high rates of DNS queries for a single upstream host.

Screenshots

Complete Verdaccio log file, showing the 404 errors: local-registry-20190926-204410.531.log

Because this is reproducible, I can supply as many similar log files as needed.

Docker || Kubernetes (please complete the following information):

  • Docker verdaccio tag: N/A
  • Docker commands N/A
  • Docker Version: 18.06.1-ce (Server) 19.03.2 (Client)

Configuration File (cat ~/.config/verdaccio/config.yaml)

We’re using the API, so here’s the JS object:

 {
    storage: <dynamically created directory just for this verdaccio instance>,
    auth: {
        htpasswd: {
            file: path.join(verdaccioDir, "htpasswd")
        }
    },
    uplinks: {
        npmjs: {
            url: "https://registry.npmjs.org/",
            max_fails: 20,
            timeout: "5s",
            fail_timeout: "1s",
        }
    },
    packages: {
        "@adpt/*": {
            access: "$all",
            publish: "$all",
        },
        "**": {
            access: "$all",
            publish: "$all",
            proxy: "npmjs"
        },
    },
    logs: [
        { type: "file", format: "pretty-timestamped", level: "debug", path: <dynamic log filename>}
    ],
}

Environment information

Verdaccio version: 4.3.1

Environment Info:

  System:
    OS: Linux 4.14 Debian GNU/Linux 9 (stretch) 9 (stretch)
    CPU: (4) x64 Intel(R) Xeon(R) CPU E5-2686 v4 @ 2.30GHz
  Binaries:
    Node: 10.15.3 - /usr/local/bin/node
    Yarn: 1.13.0 - /usr/local/bin/yarn
    npm: 6.4.1 - /usr/local/bin/npm
  Virtualization:
    Docker: 19.03.2 - /usr/bin/docker

Debugging output

See attached log file for complete logs. Important log details excerpted above.

Additional context

After quite a bit of troubleshooting and debugging, I believe that Verdaccio is indirectly causing a resource issue due to how it handles requests to the upstream server, by opening a new TCP connection (and doing a DNS lookup) for every upstream request individually.

The fix I’d like to propose is to use HTTP keepalive on the upstream requests, which will re-use those TCP connections to the upstream server multiple times, thus reducing the number of DNS lookups (and TCP setups and TLS negotiations) that Verdaccio does.

In our testing, enabling HTTP keepalive for upstream requests in Verdaccio resolves this issue. I’d be happy to submit a PR for this, if you’d like.

Issue Analytics

  • State:closed
  • Created 4 years ago
  • Comments:7 (4 by maintainers)

github_iconTop GitHub Comments

1reaction
mterrelcommented, Oct 8, 2019

I was just looking at this week’s commits to master. It looks like agent options are now supported in 4.3.3, via #1332. That’s great! 😄

So that should address the request for adding HTTP keepalive. I will do some testing with 4.3.3 to confirm.

The issue with returning 404 in response to the transient error EAI_AGAIN still remains, as far as I know.

0reactions
juanpicadocommented, Dec 8, 2020

I’ve enabled keep-alive via configuration by default, at the next major v4.11.0 in late December, I’ll enable it via code if nobody reports any issue (I don’t expect any but just being careful) 👍

Read more comments on GitHub >

github_iconTop Results From Across the Web

What's the cause of the error 'getaddrinfo EAI_AGAIN'?
EAI_AGAIN is a DNS lookup timed out error, means it is a network connectivity error or proxy related error. My main question is...
Read more >
30722 – ProxyPass results in an occasional DNS lookup failure
The problem is apr_sockaddr_info_get returning some kind of failure. I'm guessing it's an IPV4 vs IPV6 problem, and that either (1) It's ...
Read more >
ODK Central Enketo Form Server 404 Error
I'm currently searching stackoverflow to understand the error, but any assistance would be helpful. The DNS upstream that my server uses is ...
Read more >
How to fix nodejs DNS issues? - Medium
While working on big node eCOM backend that had a lot of traffic, from time to time we found getaddrinfo EAI_AGAIN error in...
Read more >
No API communication - Product support - balena Forums
This error originated either by throwing inside of an async function without a ... DNS lookup failed for api.balena-cloud.com via upstream: ...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found