question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Idle connections are broken/closed, and not detected properly, in some environments (e.g. AWS Lambda)

See original GitHub issue

Problem description

When writing an AWS lambda it is often considered a reasonable practice to store client objects (database clients etc.) in a global variable so that they can be re-used by subsequent lambda invocations.

If this approach is used with a grpc-js client/channel, and the lambda is idle for an extended period of time, when it is resumed the connection may no longer be usable (e.g. the connection may have been closed by the server due to an idle timeout). However, the getConnectivityState still reports it as “ready” (2) and attempts to use it.

The first request that is issued via this channel after this immediately fails with a grpc status 14 (UNAVAILABLE). If we immediately retry the request via an interceptor, the retry hangs until the client-side deadline is exceeded and then the request fails with DEADLINE_EXCEEDED (4).

Keepalive settings do not help in an AWS Lambda environment because the runtime is suspended between lambda invocations, and thus the keepalive pings are never sent by the client.

Reproduction steps

In my case the server side of the gRPC channel is going through an AWS Network Load Balancer. The NLB has an idle timeout of 350s: https://docs.aws.amazon.com/elasticloadbalancing/latest/network/network-load-balancers.html#connection-idle-timeout

So the repro involves:

  • Instantiating a client/channel that connects to a gRPC service through an AWS NLB
  • Making a few (successful) requests
  • Leaving the client idle for 6+ minutes
  • Attempting to make another request and observing that the client/channel are no longer usable even though the channel reports ConnectivityState 0 or 2.

I haven’t yet been able to repro this using a standalone node.js program. The simplest reproducer I’ve come up with has been via SST. Following their quickstart guide and creaging a “minimal/typescript” project, I have been able to reproduce this reliably:

  • Create a minimal typescript project
  • Modify the lambda code to create a global variable that contains a grpc client/channel so that on the first load, the connection is established through the NLB
  • Consume the client to try to make requests from within the lambda body
  • run npx sst start to launch the local development mode
  • Use the SST console to invoke the lambda a few times and observe the successful gRPC requests
  • wait for 6+ minutes
  • Use the SST console to invoke the lambda again and observe that you get UNAVAILABLE and/or DEADLINE_EXCEEDED.

I have also been able to reproduce this directly in the deployed AWS lambda, by:

  • deploying the lambda
  • Setting up “provisioned capacity” for it so that the lambda runtime will be re-used across multiple invocations
  • Invoke it successfully a few times
  • Wait 6+ minutes
  • Invoke it again and watch the gRPC connection issue recur

I would be happy to provide client code to reproduce this but the server I’m hitting is proprietary. If there is a toy service that we could use to reproduce it I don’t mind trying to stand up the NLB in front of it to help make this easier to repro/debug.

Environment

  • OS name, version and architecture: Mac OS 12.5 + SST, or AWS Lambda node.js environment
  • Node version [e.g. 8.10.0]: AWS Lambda node.js 16.x runtime; on laptop: node 16.15.0
  • Node installation method [e.g. nvm]: AWS Lambda, and on laptop nodenv.
  • If applicable, compiler version [e.g. clang 3.8.0-2ubuntu4] N/A
  • Package name and version [e.g. gRPC@1.12.0] @grpc/grpc-js v1.5.10 and v1.7.3

Additional context

These logs probably aren’t very useful since they are generated by our app, but:

21:48:32.922
{"level":20,"time":1669844912922,"pid":61312,"hostname":"neutraljanet.local","name":"RetryInterceptor","msg":"Request path: /cache_client.Scs/Set; response status code: 14; number of retries (0) is less than max (3), retrying."}
21:49:22.914
TimeoutError: 4 DEADLINE_EXCEEDED: undefined
at cacheServiceErrorMapper (file:///Users/cprice/git/momento/client-sdk-javascript/examples/sst/my-sst-app/.sst/artifacts/dev-my-sst-app-MyStack-api-Lambda_GET_-/services/functions/lambda.js:28189:14)
at Object.callback (file:///Users/cprice/git/momento/client-sdk-javascript/examples/sst/my-sst-app/.sst/artifacts/dev-my-sst-app-MyStack-api-Lambda_GET_-/services/functions/lambda.js:28623:20)
at Object.onReceiveStatus (file:///Users/cprice/git/momento/client-sdk-javascript/examples/sst/my-sst-app/.sst/artifacts/dev-my-sst-app-MyStack-api-Lambda_GET_-/services/functions/lambda.js:9829:30)
at file:///Users/cprice/git/momento/client-sdk-javascript/examples/sst/my-sst-app/.sst/artifacts/dev-my-sst-app-MyStack-api-Lambda_GET_-/services/functions/lambda.js:7386:31
at Object.onReceiveStatus (file:///Users/cprice/git/momento/client-sdk-javascript/examples/sst/my-sst-app/.sst/artifacts/dev-my-sst-app-MyStack-api-Lambda_GET_-/services/functions/lambda.js:9425:9)
at InterceptingListenerImpl.onReceiveStatus (file:///Users/cprice/git/momento/client-sdk-javascript/examples/sst/my-sst-app/.sst/artifacts/dev-my-sst-app-MyStack-api-Lambda_GET_-/services/functions/lambda.js:7382:23)
at file:///Users/cprice/git/momento/client-sdk-javascript/examples/sst/my-sst-app/.sst/artifacts/dev-my-sst-app-MyStack-api-Lambda_GET_-/services/functions/lambda.js:7386:31
at Object.onReceiveStatus (file:///Users/cprice/git/momento/client-sdk-javascript/examples/sst/my-sst-app/.sst/artifacts/dev-my-sst-app-MyStack-api-Lambda_GET_-/services/functions/lambda.js:28049:23)
at Object.onReceiveStatus (file:///Users/cprice/git/momento/client-sdk-javascript/examples/sst/my-sst-app/.sst/artifacts/dev-my-sst-app-MyStack-api-Lambda_GET_-/services/functions/lambda.js:9644:140)
at Object.onReceiveStatus (file:///Users/cprice/git/momento/client-sdk-javascript/examples/sst/my-sst-app/.sst/artifacts/dev-my-sst-app-MyStack-api-Lambda_GET_-/services/functions/lambda.js:9612:175)

Issue Analytics

  • State:closed
  • Created 10 months ago
  • Reactions:1
  • Comments:13 (6 by maintainers)

github_iconTop GitHub Comments

1reaction
murgatroid99commented, Dec 1, 2022

The suggestion about running the hello world example was just a response to

I would be happy to provide client code to reproduce this but the server I’m hitting is proprietary. If there is a toy service that we could use to reproduce it I don’t mind trying to stand up the NLB in front of it to help make this easier to repro/debug.

It’s not a high priority. I think the log should have the relevant information.

0reactions
cprice404commented, Dec 5, 2022

Thanks. Comparing our interceptor to the one that you linked, it looks like we might have copy/pasted from an older version of it that was missing a few things. I’ll compare our code with that code, and look at the spec.

Thanks again for your help. I’ll close this for now and re-open or open a new ticket if this doesn’t resolve things for us.

Read more comments on GitHub >

github_iconTop Results From Across the Web

Best practices for working with AWS Lambda functions
Use a keep-alive directive to maintain persistent connections. Lambda purges idle connections over time. Attempting to reuse an idle connection when ...
Read more >
Troubleshoot deployment issues in Lambda
Learn how to troubleshoot common deployment issues in Lambda. ... Lambda reserves some environment variable keys for internal use. For example, AWS_REGION ...
Read more >
Troubleshoot Lambda function retry and timeout issues when ...
There are three reasons why retry and timeout issues occur when invoking a Lambda function with an AWS SDK:.
Read more >
Lambda execution environment - AWS Documentation
Lambda invokes your function in an execution environment, which provides a secure and isolated runtime environment.
Read more >
Configuring Lambda function options - AWS Documentation
Configuring triggers (console)​​ You can configure other AWS services to trigger your function each time a specified event occurs. For details about how...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found