Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Random timeout/deadline exceeded unhandled rejections

See original GitHub issue

Environment details

OS: Cloud Functions environment (Ubuntu 18.04)
Node.js version: NodeJS 16
npm version: Not sure, the one used by Cloud Functions
@google-cloud/spanner version: 6.1.2, but the issue has been happening since versions 5.x

Steps to reproduce

Run a Cloud Function using the Spanner client with an opened database, even without making requests.

Context

I’ve been using the NodeJS Spanner client for several projects for about a year now, and I’ve consistently noticed DEADLINE_EXCEEDED and other related errors not directly linked to my requests, usually when the client hasn’t been used for several minutes. What I mean by that is that those appear as unhandled rejections in NodeJS. Cloud Functions makes this more visible as unhandled rejections make the functions framework crash and log the error. This issue occurs in many of my Cloud Functions performing various unrelated tasks using Spanner. Again, business code is never impacted as the unhandled rejections usually occur when the Cloud Function has been inactive for several minutes (or even several hours). This also means that this bug is much more visible when the minimum number of instances is set to >= 1, as the Cloud Function instances stay idle longer (or even indefinitely). I see no reason why this wouldn’t happen in other environments, but I’ve mainly been using Cloud Functions lately and can only testify for the behaviour in this context. All Cloud Functions show fairly low activity (most of the time they are stuck at min_instances: 1 and sometimes reach 5 instances). The Cloud Spanner instance is running with 200 processing units and is underutilised (it never goes above 10% CPU).

The problem here is really about the rejections not being caught by the client. I’m not really worried about the errors themselves as they never seem to occur when running business operations/transactions (or at least they are being caught and properly retried in this case).

My only lead would be the keepAlive mechanism as it uses setInterval and periodically makes requests to Spanner, especially when sessions aren’t being used. However after looking at the code I haven’t seen anything obvious.

Stack traces

Unfortunately as those are unhandled rejections the stack traces are usually not very helpful, but one of them clearly references the Spanner client. To give you a rough idea of the rate of occurence of the issue, it has happened 33 times in 10 days for around 100 Cloud Functions, most of which have min_instances: 1 set. Out of the 33 errors, 4 are due to CANCELLED, 3 are Total timeout of API..., and the rest are DEADLINE_EXCEEDED.

Unhandled rejection
Error: Total timeout of API google.spanner.v1.Spanner exceeded 60000 milliseconds before any response was received.
    at repeat (/workspace/node_modules/google-gax/build/src/normalCalls/retries.js:66:31)
    at Timeout._onTimeout (/workspace/node_modules/google-gax/build/src/normalCalls/retries.js:101:25)
    at listOnTimeout (node:internal/timers:559:17)
    at processTimers (node:internal/timers:502:7)
Error: Process exited with code 16
    at process.<anonymous> (/workspace/node_modules/@google-cloud/functions-framework/build/src/invoker.js:92:22)
    at process.emit (node:events:527:28)
    at process.emit (node:domain:475:12)
    at process.exit (node:internal/process/per_thread:189:15)
    at sendCrashResponse (/workspace/node_modules/@google-cloud/functions-framework/build/src/logger.js:44:9)
    at process.<anonymous> (/workspace/node_modules/@google-cloud/functions-framework/build/src/invoker.js:88:44)
    at process.emit (node:events:527:28)
    at process.emit (node:domain:475:12)
    at emit (node:internal/process/promises:140:20)
    at processPromiseRejections (node:internal/process/promises:274:27)

Unhandled rejection
Error: 4 DEADLINE_EXCEEDED: Deadline exceeded
    at Object.callErrorFromStatus (/workspace/node_modules/@grpc/grpc-js/build/src/call.js:31:26)
    at Object.onReceiveStatus (/workspace/node_modules/@grpc/grpc-js/build/src/client.js:189:52)
    at /workspace/node_modules/@grpc/grpc-js/build/src/call-stream.js:111:35
    at Object.onReceiveStatus (/workspace/node_modules/grpc-gcp/build/src/index.js:73:29)
    at InterceptingListenerImpl.onReceiveStatus (/workspace/node_modules/@grpc/grpc-js/build/src/call-stream.js:106:23)
    at Object.onReceiveStatus (/workspace/node_modules/@grpc/grpc-js/build/src/client-interceptors.js:365:141)
    at Object.onReceiveStatus (/workspace/node_modules/@grpc/grpc-js/build/src/client-interceptors.js:328:181)
    at /workspace/node_modules/@grpc/grpc-js/build/src/call-stream.js:187:78
    at processTicksAndRejections (node:internal/process/task_queues:78:11)
    at runNextTicks (node:internal/process/task_queues:65:3)
// Same Error: Process exited with code 16 coming from the functions-framework.

Unhandled rejection
Error: 1 CANCELLED: Call cancelled
    at Object.callErrorFromStatus (/workspace/node_modules/@grpc/grpc-js/build/src/call.js:31:26)
    at Object.onReceiveStatus (/workspace/node_modules/@grpc/grpc-js/build/src/client.js:189:52)
    at /workspace/node_modules/@grpc/grpc-js/build/src/call-stream.js:111:35
    at Object.onReceiveStatus (/workspace/node_modules/grpc-gcp/build/src/index.js:73:29)
    at InterceptingListenerImpl.onReceiveStatus (/workspace/node_modules/@grpc/grpc-js/build/src/call-stream.js:106:23)
    at Object.onReceiveStatus (/workspace/node_modules/@grpc/grpc-js/build/src/client-interceptors.js:365:141)
    at Object.onReceiveStatus (/workspace/node_modules/@grpc/grpc-js/build/src/client-interceptors.js:328:181)
    at /workspace/node_modules/@grpc/grpc-js/build/src/call-stream.js:187:78
    at processTicksAndRejections (node:internal/process/task_queues:78:11)
// Same Error: Process exited with code 16 coming from the functions-framework.

Issue Analytics

State:
Created a year ago
Reactions:2
Comments:24 (9 by maintainers)

Top GitHub Comments

1reaction

asthamohtacommented, Jul 21, 2022

This makes sense. Give me some time to figure out which part of the code the error should get caught in and I will add code to catch this. I am going to be on vacation for a week but I will look into this when I get back.

0reactions

flovouincommented, Sep 4, 2022

@dannnnthemannnn To be honest I wouldn’t call these parameters production ready. It just consists in setting a parameter that wasn’t made for this to an extremely high value. Setting it too high might even have some side effects.

@asthamohta Unfortunately the error has occurred again after a few days. It’s not as frequent as before, but it still occurs. Also, because the errors are still unhandled, they still make noise in Error Reporting / Cloud Logging / our alerting system.

The diagnosis (about Cloud Functions killing background activity) makes sense, and I don’t really mind that the keep alive mechanism cannot work in this case. But I’d suggest fixing the error handling like we discussed, because that’s unfortunately something we cannot work around outside the Spanner client itself. Also, although I’m aware this is more work, it might be useful to have a proper parameter for disabling the keep alive mechanism when it’s expected not to work. Currently it seems that working around it with the existing parameters does not work 100% of the time, and it makes it hard to know whether there is actually something more to look into or if it’s just that the existing parameters are not made for this.