Idle connections are broken/closed, and not detected properly, in some environments (e.g. AWS Lambda)
See original GitHub issueProblem description
When writing an AWS lambda it is often considered a reasonable practice to store client objects (database clients etc.) in a global variable so that they can be re-used by subsequent lambda invocations.
If this approach is used with a grpc-js client/channel, and the lambda is idle for an extended period of time, when it is resumed the connection may no longer be usable (e.g. the connection may have been closed by the server due to an idle timeout). However, the getConnectivityState
still reports it as “ready” (2
) and attempts to use it.
The first request that is issued via this channel after this immediately fails with a grpc status 14
(UNAVAILABLE). If we immediately retry the request via an interceptor, the retry hangs until the client-side deadline is exceeded and then the request fails with DEADLINE_EXCEEDED (4
).
Keepalive settings do not help in an AWS Lambda environment because the runtime is suspended between lambda invocations, and thus the keepalive pings are never sent by the client.
Reproduction steps
In my case the server side of the gRPC channel is going through an AWS Network Load Balancer. The NLB has an idle timeout of 350s: https://docs.aws.amazon.com/elasticloadbalancing/latest/network/network-load-balancers.html#connection-idle-timeout
So the repro involves:
- Instantiating a client/channel that connects to a gRPC service through an AWS NLB
- Making a few (successful) requests
- Leaving the client idle for 6+ minutes
- Attempting to make another request and observing that the client/channel are no longer usable even though the channel reports ConnectivityState
0
or2
.
I haven’t yet been able to repro this using a standalone node.js program. The simplest reproducer I’ve come up with has been via SST. Following their quickstart guide and creaging a “minimal/typescript” project, I have been able to reproduce this reliably:
- Create a minimal typescript project
- Modify the lambda code to create a global variable that contains a grpc client/channel so that on the first load, the connection is established through the NLB
- Consume the client to try to make requests from within the lambda body
- run
npx sst start
to launch the local development mode - Use the SST console to invoke the lambda a few times and observe the successful gRPC requests
- wait for 6+ minutes
- Use the SST console to invoke the lambda again and observe that you get UNAVAILABLE and/or DEADLINE_EXCEEDED.
I have also been able to reproduce this directly in the deployed AWS lambda, by:
- deploying the lambda
- Setting up “provisioned capacity” for it so that the lambda runtime will be re-used across multiple invocations
- Invoke it successfully a few times
- Wait 6+ minutes
- Invoke it again and watch the gRPC connection issue recur
I would be happy to provide client code to reproduce this but the server I’m hitting is proprietary. If there is a toy service that we could use to reproduce it I don’t mind trying to stand up the NLB in front of it to help make this easier to repro/debug.
Environment
- OS name, version and architecture: Mac OS 12.5 + SST, or AWS Lambda node.js environment
- Node version [e.g. 8.10.0]: AWS Lambda node.js 16.x runtime; on laptop: node 16.15.0
- Node installation method [e.g. nvm]: AWS Lambda, and on laptop nodenv.
- If applicable, compiler version [e.g. clang 3.8.0-2ubuntu4] N/A
- Package name and version [e.g. gRPC@1.12.0] @grpc/grpc-js v1.5.10 and v1.7.3
Additional context
These logs probably aren’t very useful since they are generated by our app, but:
21:48:32.922
{"level":20,"time":1669844912922,"pid":61312,"hostname":"neutraljanet.local","name":"RetryInterceptor","msg":"Request path: /cache_client.Scs/Set; response status code: 14; number of retries (0) is less than max (3), retrying."}
21:49:22.914
TimeoutError: 4 DEADLINE_EXCEEDED: undefined
at cacheServiceErrorMapper (file:///Users/cprice/git/momento/client-sdk-javascript/examples/sst/my-sst-app/.sst/artifacts/dev-my-sst-app-MyStack-api-Lambda_GET_-/services/functions/lambda.js:28189:14)
at Object.callback (file:///Users/cprice/git/momento/client-sdk-javascript/examples/sst/my-sst-app/.sst/artifacts/dev-my-sst-app-MyStack-api-Lambda_GET_-/services/functions/lambda.js:28623:20)
at Object.onReceiveStatus (file:///Users/cprice/git/momento/client-sdk-javascript/examples/sst/my-sst-app/.sst/artifacts/dev-my-sst-app-MyStack-api-Lambda_GET_-/services/functions/lambda.js:9829:30)
at file:///Users/cprice/git/momento/client-sdk-javascript/examples/sst/my-sst-app/.sst/artifacts/dev-my-sst-app-MyStack-api-Lambda_GET_-/services/functions/lambda.js:7386:31
at Object.onReceiveStatus (file:///Users/cprice/git/momento/client-sdk-javascript/examples/sst/my-sst-app/.sst/artifacts/dev-my-sst-app-MyStack-api-Lambda_GET_-/services/functions/lambda.js:9425:9)
at InterceptingListenerImpl.onReceiveStatus (file:///Users/cprice/git/momento/client-sdk-javascript/examples/sst/my-sst-app/.sst/artifacts/dev-my-sst-app-MyStack-api-Lambda_GET_-/services/functions/lambda.js:7382:23)
at file:///Users/cprice/git/momento/client-sdk-javascript/examples/sst/my-sst-app/.sst/artifacts/dev-my-sst-app-MyStack-api-Lambda_GET_-/services/functions/lambda.js:7386:31
at Object.onReceiveStatus (file:///Users/cprice/git/momento/client-sdk-javascript/examples/sst/my-sst-app/.sst/artifacts/dev-my-sst-app-MyStack-api-Lambda_GET_-/services/functions/lambda.js:28049:23)
at Object.onReceiveStatus (file:///Users/cprice/git/momento/client-sdk-javascript/examples/sst/my-sst-app/.sst/artifacts/dev-my-sst-app-MyStack-api-Lambda_GET_-/services/functions/lambda.js:9644:140)
at Object.onReceiveStatus (file:///Users/cprice/git/momento/client-sdk-javascript/examples/sst/my-sst-app/.sst/artifacts/dev-my-sst-app-MyStack-api-Lambda_GET_-/services/functions/lambda.js:9612:175)
Issue Analytics
- State:
- Created 10 months ago
- Reactions:1
- Comments:13 (6 by maintainers)
Top GitHub Comments
The suggestion about running the hello world example was just a response to
It’s not a high priority. I think the log should have the relevant information.
Thanks. Comparing our interceptor to the one that you linked, it looks like we might have copy/pasted from an older version of it that was missing a few things. I’ll compare our code with that code, and look at the spec.
Thanks again for your help. I’ll close this for now and re-open or open a new ticket if this doesn’t resolve things for us.