question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Unexpected APM Server response when polling config

See original GitHub issue

Describe the bug

We have a problem with our ES cluster, resulting in this error:

Error: Unexpected APM Server response when polling config
     at processConfigErrorResponse (.../node_modules/elastic-apm-http-client/index.js:711:15)

which is obviously our problem. But the error is unhandled and taking down the process. Is that the expected behaviour? I can get over missing apm data, but having my service restart constantly is less than ideal. I don’t want my uptime constrained to that of the APM server.

To Reproduce

Ruin your ES

Expected behavior

Is there any way to handle this kind of error, and e.g. just log it? I found this, but I don’t want to trap all exceptions from the process (we already have a handler for that, to log & exit). Can I do something like:

apm.on("error", err => { log(err); })

Am I missing something obvious? Thanks.

Environment (please complete the following information)

  • OS: Debian 9
  • Node.js version: 12.16.3 (LTS)
  • APM Server version: 7.6.1
  • Agent version: 3.5

How are you starting the agent? (please tick one of the boxes)

  • Calling agent.start() directly (e.g. require('elastic-apm-node').start(...))
  • Requiring elastic-apm-node/start from within the source code
  • Starting node with -r elastic-apm-node/start

Additional context

Using env vars for config

Issue Analytics

  • State:closed
  • Created 3 years ago
  • Reactions:15
  • Comments:32 (11 by maintainers)

github_iconTop GitHub Comments

4reactions
trentmcommented, Feb 19, 2021

I fluked and found a local repro for this.

tl;dr: It only happens with node v12 (for me at least) and only a second failed attempt to retrieve central config from APM server.

Working from this part of the traceback above:

 Emitted 'error' event on TLSSocket instance at:
     at emitErrorNT (internal/streams/destroy.js:92:8)
     at emitErrorAndCloseNT (internal/streams/destroy.js:60:3)
     at processTicksAndRejections (internal/process/task_queues.js:84:21) {

I was able to match those line numbers to node v12’s version of internal/streams/destroy.js. Using this patch to node_modules/elastic-apm-http-client/index.js (importantly the part of reducing the time for the next poll) and running a hello-world app using APM with node v12:

% node --version
v12.19.1

itself configured to talk to a mock APM-server that responds with 503 to central config requests, resulted in the crash.

I haven’t dug in any further yet.

2reactions
trentmcommented, Feb 20, 2021

I finally got back to this… and couldn’t repro for a while because I hadn’t been specific about my “hello world app” before.

better details on a repro

Here are better details on a repro.

  1. Using the following demo app:
var apm = require('elastic-apm-node').start({
  serviceName: 'issue1749-two-config-crash-app',
  metricsInterval: 0,
  // captureExceptions:false so we can see the error on the console, rather
  // than just in the error object sent to APM server.
  captureExceptions: false,
  logLevel: 'debug',
  // disableInstrumentations: ['express']
})

if (process.env.PLZ_REQUIRE_EXPRESS) {
  require('express')
}
const http = require('http')
const server = http.createServer((req, res) => {
  res.setHeader('Content-Type', 'text/plain')
  res.end('hi')
})
const port = 3000
server.listen(port, () => {
  console.log(`listening at http://localhost:${port}`)
})
  1. with the following patch to “node_modules/elastic-apm-http-client/index.js” (mainly to get the shorter _scheduleNextConfigPoll() value):
diff --git a/index.js b/index.js
index c27fb71..52ae377 100644
--- a/index.js
+++ b/index.js
@@ -163,12 +163,16 @@ Client.prototype._pollConfig = function () {
   }

   const req = this._transport.get(opts, res => {
+    console.warn('XXX res.statusCode', res.statusCode)
     res.on('error', err => {
       // Not sure this event can ever be emitted, but just in case
+      console.warn('XXX in res.on(error), calling res.destroy(err)', err)
       res.destroy(err)
     })

-    this._scheduleNextConfigPoll(getMaxAge(res))
+    // XXX
+    // this._scheduleNextConfigPoll(getMaxAge(res))
+    this._scheduleNextConfigPoll(3)

     if (
       res.statusCode === 304 || // No new config since last time
@@ -180,26 +184,36 @@ Client.prototype._pollConfig = function () {
     }

     streamToBuffer(res, (err, buf) => {
-      if (err) return res.destroy(err)
+      if (err) {
+        console.warn('XXX streamToBuffer error, calling res.destroy', err)
+        return res.destroy(err)
+      }

       if (res.statusCode === 200) {
         // 200: New config available (or no config for the given service.name / service.environment)
         const etag = res.headers.etag
         if (etag) this._conf.lastConfigEtag = etag

+        // XXX Should not emit in the try/catch. If listener throws, then we catch it here.
         try {
           this.emit('config', JSON.parse(buf))
         } catch (e) {
+          console.warn('XXX config parse error, calling res.destroy', e)
           res.destroy(e)
         }
       } else {
-        res.destroy(processConfigErrorResponse(res, buf))
+        // XXX
+        console.warn('XXX config res.statusCode !== 200, calling res.destroy', res.statusCode, )
+        var cfgErr = processConfigErrorResponse(res, buf)
+        res.destroy(cfgErr)
+        // res.destroy(processConfigErrorResponse(res, buf))
       }
     })
   })

   req.on('error', err => {
-    this._scheduleNextConfigPoll()
+    console.warn('XXX req.on error', err)
+    this._scheduleNextConfigPoll(3)
     this.emit('request-error', err)
   })
 }
@@ -213,6 +227,7 @@ Client.prototype._scheduleNextConfigPoll = function (seconds) {
     this._configTimer = null
     this._pollConfig()
   }, seconds * 1000)
+  console.warn('XXX _scheduleNextConfigPoll in %ss', seconds)

   this._configTimer.unref()
 }
  1. and letting it run for long enough (~15-20s) to make a second centralConfig poll

Running node foo.js -> no crash Running PLZ_REQUIRE_EXPRESS=1 node foo.js -> crash

I.e.: just require’ing “express” makes a difference here.

More specific version details for the record:

% npm ls express
play@1.0.1 /Users/trentm/tm/play
└── express@4.17.1

% npm ls elastic-apm-node
play@1.0.1 /Users/trentm/tm/play
└── elastic-apm-node@3.10.0  -> /Users/trentm/el/apm-agent-nodejs5


[11:20:35 trentm@pink:~/el/apm-agent-nodejs5 (git:tags/v3.10.0)]
% git log1 -1
* 68f27d3 - (HEAD, tag: v3.10.0) 3.10.0 (#1936) (6 weeks ago) <Trent Mick>

% git diff --staged
diff --git a/package.json b/package.json
index 6aef98b..da46f93 100644
--- a/package.json
+++ b/package.json
@@ -87,7 +87,7 @@
     "console-log-level": "^1.4.1",
     "cookie": "^0.4.0",
     "core-util-is": "^1.0.2",
-    "elastic-apm-http-client": "^9.4.2",
+    "elastic-apm-http-client": "file:../apm-nodejs-http-client3",
     "end-of-stream": "^1.4.4",
     "error-stack-parser": "^2.0.6",
     "escape-string-regexp": "^4.0.0",

[11:21:34 trentm@pink:~/el/apm-nodejs-http-client3 (git:tags/v9.4.2)]
% git log1 -1
* 5a501f5 - (HEAD, tag: v9.4.2) 9.4.2 (3 months ago) <Trent Mick>
  • fails with express@5 as well:
    % npm ls express
    play@1.0.1 /Users/trentm/tm/play
    └── express@5.0.0-alpha.8
    
  • does not crash if the agent is configured with: disableInstrumentations: [‘express’]

the issue

Working backwards from the crash point.

  • The crash is from an unhandled “error” event on a socket instance:

    ...
    Emitted 'error' event on Socket instance at:
        at emitErrorNT (internal/streams/destroy.js:92:8)
    ...
    
  • That “error” event is being emitted due to the agent’s apm-server client calling res.destroy(err) after an non-200 response from polling the APM server for config here. IncomingMessage.destroy([error]) emits “error” on the socket.

    So why only on the second poll to APM Server?

  • The apm-server client creates an http.Agent with keepAlive=true by default. The intent is to use that keep-alive agent for requests to the APM server. However, due to a bug in #53, the agent (set as this._agent) isn’t set until after the request options for apm-server requests are set.

    This means that this._agent === undefined here when request options are set. In other words, the keep-alive agent isn’t getting used.

  • However, if client.config(...) is called (again) on the client after initial creation, then those request options will be created and this time this._agent will be set.

    There is one current case in the Node.js APM Agent where client.config() is called again: in agent.setFramework(), which is called automatically in these module instrumentations:

    modules/hapi.js:  agent.setFramework({ name: 'hapi', version, overwrite: false })
    modules/restify.js:  agent.setFramework({ name: 'restify', version, overwrite: false })
    modules/koa.js:  agent.setFramework({ name: 'koa', version, overwrite: false })
    modules/express.js:  agent.setFramework({ name: 'express', version, overwrite: false })
    modules/fastify.js:  agent.setFramework({ name: 'fastify', version, overwrite: false })
    

    The result is that if any of those 5 modules is required in a user app, then the apm-server client will switch to using its keep-alive agent. The apm-server client always immediately polls for central config, so this switch to the keep-alive agent will only occur for 2nd and subsequent calls.

  • So why isn’t there an “error” event listener on that res.socket? Any why only in node v12? Because of this “TODO” in node core’s http client handling of keep alive after the “end” of response: https://github.com/nodejs/node/blob/v12.19.1/lib/_http_client.js#L625-L626

    socket.removeListener('error', socketErrorListener);
    socket.removeListener('data', socketOnData);
    socket.removeListener('end', socketOnEnd);
    
    // TODO(ronag): Between here and emitFreeNT the socket
    // has no 'error' handler.
    

    In other words, if you res.destroy(err) after the res “end” event in node v12 with a keep-alive agent, you’ll be emitting an “error” event on a socket with no error event listener. That results in a crash or uncaughtException handler.

a smaller repro demo

var http = require('http')
var agent = new http.Agent({keepAlive: true})
var opts = {
  agent: agent,
  hostname: 'www.google.com',
  method: 'GET',
  path: '/'
}

function makeReq() {
  var req = http.get(opts, function (res) {
      console.warn('res.socket "error" event listeners before end:', res.socket.listeners('error'))
      res.on('data', function (chunk) { })
      res.on('end', function () {
        console.warn('res.on end')
        console.warn('res.socket "error" event listeners after end:', res.socket.listeners('error'))
        res.destroy(new Error('call the whole thing off'))
      })
      res.on('error', function (err) {
        console.warn('res.on error', err)
      })
  })
  req.on('error', function (err) {
    console.warn('req.on error', err)
  })
}

makeReq()

setTimeout(function () {}, 10000000) // don't exit right away

Run that “bar.js” with node v12:

% node bar.js
res.socket "error" event listeners before end: [ [Function: socketErrorListener] ]
res.on end
res.socket "error" event listeners after end: []
events.js:291
      throw er; // Unhandled 'error' event
      ^

Error: call the whole thing off
    at IncomingMessage.<anonymous> (/Users/trentm/tm/play/bar.js:17:21)
    at IncomingMessage.emit (events.js:326:22)
    at endReadableNT (_stream_readable.js:1223:12)
    at processTicksAndRejections (internal/process/task_queues.js:84:21)
Emitted 'error' event on Socket instance at:
    at emitErrorNT (internal/streams/destroy.js:92:8)
    at emitErrorAndCloseNT (internal/streams/destroy.js:60:3)
    at processTicksAndRejections (internal/process/task_queues.js:84:21)

Run it with NODE_DEBUG=http to see more of what is going on in the node core http client.

% NODE_DEBUG=http node bar.js
HTTP 64387: call onSocket 0 0
HTTP 64387: createConnection www.google.com:80: {
  agent: Agent {
    _events: [Object: null prototype] {
      free: [Function],
      newListener: [Function: maybeEnableKeylog]
    },
    _eventsCount: 2,
    _maxListeners: undefined,
    defaultPort: 80,
    protocol: 'http:',
    options: { keepAlive: true, path: null },
    requests: {},
    sockets: { 'www.google.com:80:': [] },
    freeSockets: {},
    keepAliveMsecs: 1000,
    keepAlive: true,
    maxSockets: Infinity,
    maxFreeSockets: 256,
    maxTotalSockets: Infinity,
    totalSocketCount: 0,
    [Symbol(kCapture)]: false
  },
  hostname: 'www.google.com',
  method: 'GET',
  path: null,
  port: 80,
  host: 'www.google.com',
  keepAlive: true,
  servername: 'www.google.com',
  _agentKey: 'www.google.com:80:'
}
HTTP 64387: sockets www.google.com:80: 1 1
HTTP 64387: outgoing message end.
(node:64387) Warning: Setting the NODE_DEBUG environment variable to 'http' can expose sensitive data (such as passwords, tokens and authentication headers) in the resulting log.
HTTP 64387: AGENT incoming response!
res.socket "error" event listeners before end: [ [Function: socketErrorListener] ]
HTTP 64387: AGENT socket keep-alive
res.on end
res.socket "error" event listeners after end: []
HTTP 64387: CLIENT socket onFree
HTTP 64387: agent.on(free) www.google.com:80:
events.js:291
      throw er; // Unhandled 'error' event
      ^

Error: call the whole thing off
    at IncomingMessage.<anonymous> (/Users/trentm/tm/play/bar.js:17:21)
    at IncomingMessage.emit (events.js:326:22)
    at endReadableNT (_stream_readable.js:1223:12)
    at processTicksAndRejections (internal/process/task_queues.js:84:21)
Emitted 'error' event on Socket instance at:
    at emitErrorNT (internal/streams/destroy.js:92:8)
    at emitErrorAndCloseNT (internal/streams/destroy.js:60:3)
    at processTicksAndRejections (internal/process/task_queues.js:84:21)

I’ll start a fix on Monday.

Read more comments on GitHub >

github_iconTop Results From Across the Web

APM Server transport error (503) - Elastic Discuss
APM Server transport error (503): Unexpected APM Server response · You could enable debug logging in your APM Server and check for further...
Read more >
Elastic APM HTTP error (503): queue is full - Stack Overflow
A full queue generally means that the agents collect more data than APM server is able to process. This might happen when APM...
Read more >
elastic-apm-http-client - npm
A low-level HTTP client for communicating with the Elastic APM intake API. Latest version: 11.0.4, last published: 5 days ago.
Read more >
unexpected apm server response when polling config ... - 掘金
unexpected apm server response when polling config unable to retrieve connection to kibana技术、学习、经验文章掘金开发者社区搜索结果。
Read more >
Configure APM for Node.js - TechDocs - Broadcom Inc.
js application settings. These settings do not require Node.js Probe Agent configuration. To know more about how you can change the NodeJS agent ......
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found