Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Idle connections are broken/closed, and not detected properly, in some environments (e.g. AWS Lambda)

See original GitHub issue

Problem description

When writing an AWS lambda it is often considered a reasonable practice to store client objects (database clients etc.) in a global variable so that they can be re-used by subsequent lambda invocations.

If this approach is used with a grpc-js client/channel, and the lambda is idle for an extended period of time, when it is resumed the connection may no longer be usable (e.g. the connection may have been closed by the server due to an idle timeout). However, the getConnectivityState still reports it as “ready” (2) and attempts to use it.

The first request that is issued via this channel after this immediately fails with a grpc status 14 (UNAVAILABLE). If we immediately retry the request via an interceptor, the retry hangs until the client-side deadline is exceeded and then the request fails with DEADLINE_EXCEEDED (4).

Keepalive settings do not help in an AWS Lambda environment because the runtime is suspended between lambda invocations, and thus the keepalive pings are never sent by the client.

Reproduction steps

In my case the server side of the gRPC channel is going through an AWS Network Load Balancer. The NLB has an idle timeout of 350s: https://docs.aws.amazon.com/elasticloadbalancing/latest/network/network-load-balancers.html#connection-idle-timeout

So the repro involves:

Instantiating a client/channel that connects to a gRPC service through an AWS NLB
Making a few (successful) requests
Leaving the client idle for 6+ minutes
Attempting to make another request and observing that the client/channel are no longer usable even though the channel reports ConnectivityState 0 or 2.

I haven’t yet been able to repro this using a standalone node.js program. The simplest reproducer I’ve come up with has been via SST. Following their quickstart guide and creaging a “minimal/typescript” project, I have been able to reproduce this reliably:

Create a minimal typescript project
Modify the lambda code to create a global variable that contains a grpc client/channel so that on the first load, the connection is established through the NLB
Consume the client to try to make requests from within the lambda body
run npx sst start to launch the local development mode
Use the SST console to invoke the lambda a few times and observe the successful gRPC requests
wait for 6+ minutes
Use the SST console to invoke the lambda again and observe that you get UNAVAILABLE and/or DEADLINE_EXCEEDED.

I have also been able to reproduce this directly in the deployed AWS lambda, by:

deploying the lambda
Setting up “provisioned capacity” for it so that the lambda runtime will be re-used across multiple invocations
Invoke it successfully a few times
Wait 6+ minutes
Invoke it again and watch the gRPC connection issue recur

I would be happy to provide client code to reproduce this but the server I’m hitting is proprietary. If there is a toy service that we could use to reproduce it I don’t mind trying to stand up the NLB in front of it to help make this easier to repro/debug.

Environment

OS name, version and architecture: Mac OS 12.5 + SST, or AWS Lambda node.js environment
Node version [e.g. 8.10.0]: AWS Lambda node.js 16.x runtime; on laptop: node 16.15.0
Node installation method [e.g. nvm]: AWS Lambda, and on laptop nodenv.
If applicable, compiler version [e.g. clang 3.8.0-2ubuntu4] N/A
Package name and version [e.g. gRPC@1.12.0] @grpc/grpc-js v1.5.10 and v1.7.3

Additional context

These logs probably aren’t very useful since they are generated by our app, but:

21:48:32.922
{"level":20,"time":1669844912922,"pid":61312,"hostname":"neutraljanet.local","name":"RetryInterceptor","msg":"Request path: /cache_client.Scs/Set; response status code: 14; number of retries (0) is less than max (3), retrying."}
21:49:22.914
TimeoutError: 4 DEADLINE_EXCEEDED: undefined
at cacheServiceErrorMapper (file:///Users/cprice/git/momento/client-sdk-javascript/examples/sst/my-sst-app/.sst/artifacts/dev-my-sst-app-MyStack-api-Lambda_GET_-/services/functions/lambda.js:28189:14)
at Object.callback (file:///Users/cprice/git/momento/client-sdk-javascript/examples/sst/my-sst-app/.sst/artifacts/dev-my-sst-app-MyStack-api-Lambda_GET_-/services/functions/lambda.js:28623:20)
at Object.onReceiveStatus (file:///Users/cprice/git/momento/client-sdk-javascript/examples/sst/my-sst-app/.sst/artifacts/dev-my-sst-app-MyStack-api-Lambda_GET_-/services/functions/lambda.js:9829:30)
at file:///Users/cprice/git/momento/client-sdk-javascript/examples/sst/my-sst-app/.sst/artifacts/dev-my-sst-app-MyStack-api-Lambda_GET_-/services/functions/lambda.js:7386:31
at Object.onReceiveStatus (file:///Users/cprice/git/momento/client-sdk-javascript/examples/sst/my-sst-app/.sst/artifacts/dev-my-sst-app-MyStack-api-Lambda_GET_-/services/functions/lambda.js:9425:9)
at InterceptingListenerImpl.onReceiveStatus (file:///Users/cprice/git/momento/client-sdk-javascript/examples/sst/my-sst-app/.sst/artifacts/dev-my-sst-app-MyStack-api-Lambda_GET_-/services/functions/lambda.js:7382:23)
at file:///Users/cprice/git/momento/client-sdk-javascript/examples/sst/my-sst-app/.sst/artifacts/dev-my-sst-app-MyStack-api-Lambda_GET_-/services/functions/lambda.js:7386:31
at Object.onReceiveStatus (file:///Users/cprice/git/momento/client-sdk-javascript/examples/sst/my-sst-app/.sst/artifacts/dev-my-sst-app-MyStack-api-Lambda_GET_-/services/functions/lambda.js:28049:23)
at Object.onReceiveStatus (file:///Users/cprice/git/momento/client-sdk-javascript/examples/sst/my-sst-app/.sst/artifacts/dev-my-sst-app-MyStack-api-Lambda_GET_-/services/functions/lambda.js:9644:140)
at Object.onReceiveStatus (file:///Users/cprice/git/momento/client-sdk-javascript/examples/sst/my-sst-app/.sst/artifacts/dev-my-sst-app-MyStack-api-Lambda_GET_-/services/functions/lambda.js:9612:175)

Issue Analytics

State:
Created 10 months ago
Reactions:1
Comments:13 (6 by maintainers)

Top GitHub Comments

1reaction

murgatroid99commented, Dec 1, 2022

The suggestion about running the hello world example was just a response to

I would be happy to provide client code to reproduce this but the server I’m hitting is proprietary. If there is a toy service that we could use to reproduce it I don’t mind trying to stand up the NLB in front of it to help make this easier to repro/debug.

It’s not a high priority. I think the log should have the relevant information.

0reactions

cprice404commented, Dec 5, 2022

Thanks. Comparing our interceptor to the one that you linked, it looks like we might have copy/pasted from an older version of it that was missing a few things. I’ll compare our code with that code, and look at the spec.

Thanks again for your help. I’ll close this for now and re-open or open a new ticket if this doesn’t resolve things for us.

Top Results From Across the Web

Best practices for working with AWS Lambda functions

Use a keep-alive directive to maintain persistent connections. Lambda purges idle connections over time. Attempting to reuse an idle connection when ...

Troubleshoot deployment issues in Lambda

Learn how to troubleshoot common deployment issues in Lambda. ... Lambda reserves some environment variable keys for internal use. For example, AWS_REGION ...

Troubleshoot Lambda function retry and timeout issues when ...

There are three reasons why retry and timeout issues occur when invoking a Lambda function with an AWS SDK:.

Lambda execution environment - AWS Documentation

Lambda invokes your function in an execution environment, which provides a secure and isolated runtime environment.

Configuring Lambda function options - AWS Documentation

Configuring triggers (console) You can configure other AWS services to trigger your function each time a specified event occurs. For details about how...