Unexpected APM Server response when polling config
See original GitHub issueDescribe the bug
We have a problem with our ES cluster, resulting in this error:
Error: Unexpected APM Server response when polling config
at processConfigErrorResponse (.../node_modules/elastic-apm-http-client/index.js:711:15)
which is obviously our problem. But the error is unhandled and taking down the process. Is that the expected behaviour? I can get over missing apm data, but having my service restart constantly is less than ideal. I don’t want my uptime constrained to that of the APM server.
To Reproduce
Ruin your ES
Expected behavior
Is there any way to handle this kind of error, and e.g. just log it? I found this, but I don’t want to trap all exceptions from the process (we already have a handler for that, to log & exit). Can I do something like:
apm.on("error", err => { log(err); })
Am I missing something obvious? Thanks.
Environment (please complete the following information)
- OS: Debian 9
- Node.js version: 12.16.3 (LTS)
- APM Server version: 7.6.1
- Agent version: 3.5
How are you starting the agent? (please tick one of the boxes)
- Calling
agent.start()
directly (e.g.require('elastic-apm-node').start(...)
) - Requiring
elastic-apm-node/start
from within the source code - Starting node with
-r elastic-apm-node/start
Additional context
Using env vars for config
Issue Analytics
- State:
- Created 3 years ago
- Reactions:15
- Comments:32 (11 by maintainers)
Top GitHub Comments
I fluked and found a local repro for this.
tl;dr: It only happens with node v12 (for me at least) and only a second failed attempt to retrieve central config from APM server.
Working from this part of the traceback above:
I was able to match those line numbers to node v12’s version of internal/streams/destroy.js. Using this patch to node_modules/elastic-apm-http-client/index.js (importantly the part of reducing the time for the next poll) and running a hello-world app using APM with node v12:
itself configured to talk to a mock APM-server that responds with 503 to central config requests, resulted in the crash.
I haven’t dug in any further yet.
I finally got back to this… and couldn’t repro for a while because I hadn’t been specific about my “hello world app” before.
better details on a repro
Here are better details on a repro.
Running
node foo.js
-> no crash RunningPLZ_REQUIRE_EXPRESS=1 node foo.js
-> crashI.e.: just require’ing “express” makes a difference here.
More specific version details for the record:
the issue
Working backwards from the crash point.
The crash is from an unhandled “error” event on a socket instance:
That “error” event is being emitted due to the agent’s apm-server client calling
res.destroy(err)
after an non-200 response from polling the APM server for config here. IncomingMessage.destroy([error]) emits “error” on the socket.So why only on the second poll to APM Server?
The apm-server client creates an http.Agent with
keepAlive=true
by default. The intent is to use that keep-alive agent for requests to the APM server. However, due to a bug in #53, the agent (set asthis._agent
) isn’t set until after the request options for apm-server requests are set.This means that
this._agent === undefined
here when request options are set. In other words, the keep-alive agent isn’t getting used.However, if
client.config(...)
is called (again) on the client after initial creation, then those request options will be created and this timethis._agent
will be set.There is one current case in the Node.js APM Agent where
client.config()
is called again: inagent.setFramework()
, which is called automatically in these module instrumentations:The result is that if any of those 5 modules is
require
d in a user app, then the apm-server client will switch to using its keep-alive agent. The apm-server client always immediately polls for central config, so this switch to the keep-alive agent will only occur for 2nd and subsequent calls.So why isn’t there an “error” event listener on that res.socket? Any why only in node v12? Because of this “TODO” in node core’s http client handling of keep alive after the “end” of response: https://github.com/nodejs/node/blob/v12.19.1/lib/_http_client.js#L625-L626
In other words, if you
res.destroy(err)
after the res “end” event in node v12 with a keep-alive agent, you’ll be emitting an “error” event on a socket with no error event listener. That results in a crash or uncaughtException handler.a smaller repro demo
Run that “bar.js” with node v12:
Run it with
NODE_DEBUG=http
to see more of what is going on in the node core http client.I’ll start a fix on Monday.