question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Connection reset by peer when using AWS Lambda

See original GitHub issue

Steps to reproduce

  1. Deploy a simple .net core 3.x app which is using aurora-postgresql (version >= 11) to AWS lambda Example of the configuration:
services.AddDbContextPool<DriveLogsDbContext, DriveLogsDbPostgresContext>(options =>
    options.UseNpgsql(_config.GetConnectionString("PostgresDriveLogsDb"),
            opts => opts.SetPostgresVersion(12, 4))
        .UseSnakeCaseNamingConvention());

Connection string (pool size does not matter)

"server=some-aurora-postgres-rds-endpoint;userid=root;pwd=pwd;port=5432;database=dbname;Minimum Pool Size=5;"
  1. curl some endpoint which accesses the db (read only is fine)
  2. Wait for 10 minutes (time could vary)
  3. curl the same endpoint again

The issue

During the second request connection to the db will be interrupted in the middle of execution by the lambda

[Error] Microsoft.EntityFrameworkCore.Database.Command: Failed executing DbCommand (22ms)
[Parameters=[@__filter_Name_0='?'], CommandType='Text', CommandTimeout='30']
SELECT COUNT(*)::INT FROM some_table AS l WHERE some_table.name = @__filter_Name_0
[Error] Microsoft.EntityFrameworkCore.Database.Command: Failed executing DbCommand (22ms)
[Parameters=[@__filter_Name_0='?'], CommandType='Text', CommandTimeout='30']
SELECT COUNT(*)::INT FROM some_table AS l WHERE some_table.name = @__filter_Name_0

[Error] Microsoft.EntityFrameworkCore.Query: An exception occurred while iterating over the results of a query for context type 'DriveLogs.Data.DriveLogsDbPostgresContext'.
System.InvalidOperationException: An exception has been raised that is likely due to a transient failure.
 ---> Npgsql.NpgsqlException (0x80004005): Exception while reading from stream
 ---> System.IO.IOException: Unable to read data from the transport connection: Connection reset by peer.
 ---> System.Net.Sockets.SocketException (104): Connection reset by peer
   at System.Net.Sockets.NetworkStream.Read(Byte[] buffer, Int32 offset, Int32 size)
   --- End of inner exception stack trace ---
   at System.Net.Sockets.NetworkStream.Read(Byte[] buffer, Int32 offset, Int32 size)
   at Npgsql.NpgsqlReadBuffer.<>c__DisplayClass30_0.<<Ensure>g__EnsureLong|0>d.MoveNext()
   at Npgsql.NpgsqlReadBuffer.<>c__DisplayClass30_0.<<Ensure>g__EnsureLong|0>d.MoveNext()
--- End of stack trace from previous location where exception was thrown ---
   at Npgsql.NpgsqlConnector.<>c__DisplayClass160_0.<<DoReadMessage>g__ReadMessageLong|0>d.MoveNext()
--- End of stack trace from previous location where exception was thrown ---
   at Npgsql.NpgsqlConnector.<>c__DisplayClass160_0.<<DoReadMessage>g__ReadMessageLong|0>d.MoveNext()
--- End of stack trace from previous location where exception was thrown ---
   at Npgsql.NpgsqlDataReader.NextResult(Boolean async, Boolean isConsuming)
   at Npgsql.NpgsqlDataReader.NextResult()
   at Npgsql.NpgsqlCommand.ExecuteReaderAsync(CommandBehavior behavior, Boolean async, CancellationToken cancellationToken)
   at Npgsql.NpgsqlCommand.ExecuteReader(CommandBehavior behavior)
   at Npgsql.NpgsqlCommand.ExecuteDbDataReader(CommandBehavior behavior)
   at System.Data.Common.DbCommand.ExecuteReader()
   at Microsoft.EntityFrameworkCore.Storage.RelationalCommand.ExecuteReader(RelationalCommandParameterObject parameterObject)
   at Microsoft.EntityFrameworkCore.Query.Internal.QueryingEnumerable`1.Enumerator.InitializeReader(DbContext _, Boolean result)
   at Npgsql.EntityFrameworkCore.PostgreSQL.Storage.Internal.NpgsqlExecutionStrategy.Execute[TState,TResult](TState state, Func`3 operation, Func`3 verifySucceeded)
   --- End of inner exception stack trace ---
   at Npgsql.EntityFrameworkCore.PostgreSQL.Storage.Internal.NpgsqlExecutionStrategy.Execute[TState,TResult](TState state, Func`3 operation, Func`3 verifySucceeded)
   at Microsoft.EntityFrameworkCore.Query.Internal.QueryingEnumerable`1.Enumerator.MoveNext() 

Postgres log:

2021-02-22 05:49:00 UTC:172.20.66.0(51071):root@drivelogs:[31559]:LOG:
could not receive data from client: Connection reset by peer

Further technical details

Npgsql version: 4.1.8.0 PostgreSQL version: 12.4 Operating system: AWS lambda

Other details about my project setup:

  1. The same app with mysql is working fine. So this is not our AWS setup/lambda issue. But it could be an AWS postgres issue.
  2. I have two completely different projects the second one is using postgres 11.x (don’t remember the minor version) and it is experiencing the same issue
  3. The issue does not happen during lambda cold start. And there is should be some considerable delay between requests to reproduce it
  4. I do saw the same tickets here previously but all of them were closed for some reasons without any solution. Now I have time, test stand and strong wish to fix the issue. What Im asking is some guidance on how to debug the issue
  5. If it would help I can create a simple example project and push it to some public repo.

Issue Analytics

  • State:closed
  • Created 3 years ago
  • Comments:14 (4 by maintainers)

github_iconTop GitHub Comments

7reactions
cfbaocommented, Sep 13, 2021

Our team encountered the exact same problem, and I delved a bit deeper. Hopefully someone will find below useful.


AWS Lambda purges idle connections over time (AWS docs). Allegedly the threshold is 350 seconds (Couchbase forum).

If a Lambda function’s two consecutive invocations are separated by about 10 minutes:

  1. The function’s execution environment is likely kept around (aka still “warm”).
  2. Npgsql’s own connection pruning mechanism stops working because the Lambda function’s execution environment is in a frozen state between invocations.
  3. Lambda purges the idle TCP connection to the DB because it’s been idling for more than 350 seconds.
  4. On the second invocation, Npgsql tries to use a stale connection (because of 2) but the underlying physical connection is dead (because of 3), resulting in the “connection reset by peer” error.

Adding Keepalive=30; seems to work in practice, but in an unexpected way, and probably with a race condition:

  • Between invocations, the keepalive mechanism also stops working, because the execution environment is frozen, so the connection is not kept alive.
  • AWS Lambda still purges the idle connection.
  • On the second invocation, if the keepalive action is performed before our application code uses any connections (always the case in my tests), the “connection reset by peer” error still happens (same reason as above), but it’s encountered by the keepalive action instead. Npgsql then closes this stale connection. When our application code “opens” a connection, Npgsql will return a new one this time. (All this is confirmed by turning on Npgsql logging).
  • The race condition: there’s no guarantee that the keepalive action is performed before application code uses a connection.

An alternative solution without race condition: set ConnectionLifetime to a positive value below Lambda max connection idling time, e.g. 180 sec:

  • Connections are still reused within a single Lambda function invocation or with quick successive invocation, assuming your Lambda function completes within this time span.
  • Npgsql won’t ever return stale connections, despite its pruning mechanism being frozen by AWS Lambda.
1reaction
fernando-cicconetocommented, Sep 29, 2021

We upgraded to Npgsql 5 and set ConnectionLifetime=300 in our functions yesterday. Not seeing any connection error since then so far.

Read more comments on GitHub >

github_iconTop Results From Across the Web

AWS lambda throws read: connection reset by peer
Error read: connection reset by peer means that TCP connection was closed. It is hard to say what can happen without access to...
Read more >
Troubleshoot networking issues in Lambda
Network connectivity errors can result from issues with your VPC's routing configuration, security group rules, AWS Identity and Access Management (IAM) role ...
Read more >
AWS lambda throws read: connection ... - appsloveworld.com
Error read: connection reset by peer means that TCP connection was closed. It is hard to say what can happen without access to...
Read more >
[Errno 104] Connection reset by peer / 107, 'Transport ...
I've been talking to AWS and they said i need to retry. This is the 2nd error that i see happening: requests.exceptions.ConnectionError: (' ......
Read more >
How do I troubleshoot Lambda function failures?
Connection reset by peer. ECONNRESET ECONNREFUSED. To troubleshoot Lambda networking errors. 1. Confirm that there's a valid network path to the endpoint ...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found