Random timeout/deadline exceeded unhandled rejections
See original GitHub issueEnvironment details
- OS: Cloud Functions environment (Ubuntu 18.04)
- Node.js version: NodeJS 16
- npm version: Not sure, the one used by Cloud Functions
@google-cloud/spanner
version:6.1.2
, but the issue has been happening since versions5.x
Steps to reproduce
Run a Cloud Function using the Spanner client with an opened database, even without making requests.
Context
I’ve been using the NodeJS Spanner client for several projects for about a year now, and I’ve consistently noticed DEADLINE_EXCEEDED
and other related errors not directly linked to my requests, usually when the client hasn’t been used for several minutes.
What I mean by that is that those appear as unhandled rejections in NodeJS. Cloud Functions makes this more visible as unhandled rejections make the functions framework crash and log the error.
This issue occurs in many of my Cloud Functions performing various unrelated tasks using Spanner. Again, business code is never impacted as the unhandled rejections usually occur when the Cloud Function has been inactive for several minutes (or even several hours). This also means that this bug is much more visible when the minimum number of instances is set to >= 1
, as the Cloud Function instances stay idle longer (or even indefinitely).
I see no reason why this wouldn’t happen in other environments, but I’ve mainly been using Cloud Functions lately and can only testify for the behaviour in this context.
All Cloud Functions show fairly low activity (most of the time they are stuck at min_instances: 1
and sometimes reach 5 instances). The Cloud Spanner instance is running with 200
processing units and is underutilised (it never goes above 10% CPU).
The problem here is really about the rejections not being caught by the client. I’m not really worried about the errors themselves as they never seem to occur when running business operations/transactions (or at least they are being caught and properly retried in this case).
My only lead would be the keepAlive
mechanism as it uses setInterval
and periodically makes requests to Spanner, especially when sessions aren’t being used. However after looking at the code I haven’t seen anything obvious.
Stack traces
Unfortunately as those are unhandled rejections the stack traces are usually not very helpful, but one of them clearly references the Spanner client. To give you a rough idea of the rate of occurence of the issue, it has happened 33 times in 10 days for around 100 Cloud Functions, most of which have min_instances: 1
set. Out of the 33 errors, 4 are due to CANCELLED
, 3 are Total timeout of API...
, and the rest are DEADLINE_EXCEEDED
.
Unhandled rejection
Error: Total timeout of API google.spanner.v1.Spanner exceeded 60000 milliseconds before any response was received.
at repeat (/workspace/node_modules/google-gax/build/src/normalCalls/retries.js:66:31)
at Timeout._onTimeout (/workspace/node_modules/google-gax/build/src/normalCalls/retries.js:101:25)
at listOnTimeout (node:internal/timers:559:17)
at processTimers (node:internal/timers:502:7)
Error: Process exited with code 16
at process.<anonymous> (/workspace/node_modules/@google-cloud/functions-framework/build/src/invoker.js:92:22)
at process.emit (node:events:527:28)
at process.emit (node:domain:475:12)
at process.exit (node:internal/process/per_thread:189:15)
at sendCrashResponse (/workspace/node_modules/@google-cloud/functions-framework/build/src/logger.js:44:9)
at process.<anonymous> (/workspace/node_modules/@google-cloud/functions-framework/build/src/invoker.js:88:44)
at process.emit (node:events:527:28)
at process.emit (node:domain:475:12)
at emit (node:internal/process/promises:140:20)
at processPromiseRejections (node:internal/process/promises:274:27)
Unhandled rejection
Error: 4 DEADLINE_EXCEEDED: Deadline exceeded
at Object.callErrorFromStatus (/workspace/node_modules/@grpc/grpc-js/build/src/call.js:31:26)
at Object.onReceiveStatus (/workspace/node_modules/@grpc/grpc-js/build/src/client.js:189:52)
at /workspace/node_modules/@grpc/grpc-js/build/src/call-stream.js:111:35
at Object.onReceiveStatus (/workspace/node_modules/grpc-gcp/build/src/index.js:73:29)
at InterceptingListenerImpl.onReceiveStatus (/workspace/node_modules/@grpc/grpc-js/build/src/call-stream.js:106:23)
at Object.onReceiveStatus (/workspace/node_modules/@grpc/grpc-js/build/src/client-interceptors.js:365:141)
at Object.onReceiveStatus (/workspace/node_modules/@grpc/grpc-js/build/src/client-interceptors.js:328:181)
at /workspace/node_modules/@grpc/grpc-js/build/src/call-stream.js:187:78
at processTicksAndRejections (node:internal/process/task_queues:78:11)
at runNextTicks (node:internal/process/task_queues:65:3)
// Same Error: Process exited with code 16 coming from the functions-framework.
Unhandled rejection
Error: 1 CANCELLED: Call cancelled
at Object.callErrorFromStatus (/workspace/node_modules/@grpc/grpc-js/build/src/call.js:31:26)
at Object.onReceiveStatus (/workspace/node_modules/@grpc/grpc-js/build/src/client.js:189:52)
at /workspace/node_modules/@grpc/grpc-js/build/src/call-stream.js:111:35
at Object.onReceiveStatus (/workspace/node_modules/grpc-gcp/build/src/index.js:73:29)
at InterceptingListenerImpl.onReceiveStatus (/workspace/node_modules/@grpc/grpc-js/build/src/call-stream.js:106:23)
at Object.onReceiveStatus (/workspace/node_modules/@grpc/grpc-js/build/src/client-interceptors.js:365:141)
at Object.onReceiveStatus (/workspace/node_modules/@grpc/grpc-js/build/src/client-interceptors.js:328:181)
at /workspace/node_modules/@grpc/grpc-js/build/src/call-stream.js:187:78
at processTicksAndRejections (node:internal/process/task_queues:78:11)
// Same Error: Process exited with code 16 coming from the functions-framework.
Issue Analytics
- State:
- Created a year ago
- Reactions:2
- Comments:24 (9 by maintainers)
Top GitHub Comments
This makes sense. Give me some time to figure out which part of the code the error should get caught in and I will add code to catch this. I am going to be on vacation for a week but I will look into this when I get back.
@dannnnthemannnn To be honest I wouldn’t call these parameters production ready. It just consists in setting a parameter that wasn’t made for this to an extremely high value. Setting it too high might even have some side effects.
@asthamohta Unfortunately the error has occurred again after a few days. It’s not as frequent as before, but it still occurs. Also, because the errors are still unhandled, they still make noise in Error Reporting / Cloud Logging / our alerting system.
The diagnosis (about Cloud Functions killing background activity) makes sense, and I don’t really mind that the keep alive mechanism cannot work in this case. But I’d suggest fixing the error handling like we discussed, because that’s unfortunately something we cannot work around outside the Spanner client itself. Also, although I’m aware this is more work, it might be useful to have a proper parameter for disabling the keep alive mechanism when it’s expected not to work. Currently it seems that working around it with the existing parameters does not work 100% of the time, and it makes it hard to know whether there is actually something more to look into or if it’s just that the existing parameters are not made for this.