Suspected race condition in NpgsqlConnector.EndUserAction
See original GitHub issueSteps to reproduce
Unfortunately, so far I have not been able to identify steps to reproduce this problem. We only know that most often the problem occurs when there are certain problems on the database server
The issue
From time to time I see an error with the following content:
Exception type: System.Threading.SemaphoreFullException
Exception message: Adding the specified count to the semaphore would cause it to exceed its maximum count.
Stack trace:
at System.Threading.SemaphoreSlim.Release(Int32 releaseCount)
at Npgsql.NpgsqlConnector.EndUserAction()
at Npgsql.NpgsqlDataReader.Cleanup(Boolean async, Boolean connectionClosing)
at Npgsql.NpgsqlDataReader.Close(Boolean connectionClosing, Boolean async)
at Npgsql.NpgsqlConnector.CloseOngoingOperations(Boolean async)
at Npgsql.NpgsqlConnection.<Close>g__CloseOngoingOperationAndFinish|76_0(NpgsqlConnection connection, NpgsqlConnector connector, Boolean async)
at Microsoft.EntityFrameworkCore.Storage.RelationalConnection.CloseAsync()
at Microsoft.EntityFrameworkCore.Storage.RelationalConnection.CloseAsync()
at Microsoft.EntityFrameworkCore.Storage.RelationalCommand.CleanupCommandAsync(DbCommand command, IRelationalConnection connection)
at Microsoft.EntityFrameworkCore.Storage.RelationalCommand.ExecuteReaderAsync(RelationalCommandParameterObject parameterObject, CancellationToken cancellationToken)
at Microsoft.EntityFrameworkCore.Query.Internal.QueryingEnumerable`1.AsyncEnumerator.InitializeReaderAsync(DbContext _, Boolean result, CancellationToken cancellationToken)
at Npgsql.EntityFrameworkCore.PostgreSQL.Storage.Internal.NpgsqlExecutionStrategy.ExecuteAsync[TState,TResult](TState state, Func`4 operation, Func`4 verifySucceeded, CancellationToken cancellationToken)
at Microsoft.EntityFrameworkCore.Query.Internal.QueryingEnumerable`1.AsyncEnumerator.MoveNextAsync()
After that, using dotnet-counters, I see an increased number of busy connections. If you look at pg_stat_activity, we see that there are actually no “busy” connections, there are Idle connections that are no longer used for Keep Alive.
Further technical details
Npgsql version: 4.1.6 Microsoft.EntityFrameworkCore version: 3.1.10 PostgreSQL version: PostgreSQL 10.6 on x86_64-pc-linux-gnu, compiled by gcc (GCC) 4.8.5 20150623 (Red Hat 4.8.5-28), 64-bit Operating system: Linux centos-release-7-5.1804.el7.centos.x86_64
Other details about my project setup:
Keepalive enabled! ConnectionString: User ID=user;Password=password;Host=host;Port=5432;Database=dbname;Keepalive=1;Pooling=true;Minimum Pool Size=69;Maximum Pool Size=70;Command Timeout=300;Application Name=AppWithProblems;Max Auto Prepare=150;
I looked at the code, but the only place I have doubts about is NpgsqlConnector.EndUserAction
if (_isKeepAliveEnabled)
{
lock (this)
{
if (IsReady || !IsConnected)
return;
var keepAlive = Settings.KeepAlive * 1000;
_keepAliveTimer!.Change(keepAlive, keepAlive);
Log.Trace("End user action", Id);
_currentCommand = null;
_userLock!.Release();
State = ConnectorState.Ready;
}
}
Specifically the following code:
if (IsReady || !IsConnected)
return;
Here, two comparisons are made with a variable that can potentially change at the moment when the first comparison has already been made, and the second has not yet. (Both properties (IsReady and IsConnected) compare the State property to ConnectorState enum elements. State property declared as:
internal ConnectorState State
{
get => (ConnectorState)_state;
set
{
var newState = (int)value;
if (newState == _state)
return;
Interlocked.Exchange(ref _state, newState);
}
}
What do you think about that? Could there be a race condition here, or am I wrong?
Would a fix like this help here ?:
//NpgsqlConnector.EndUserAction:
//...
if (_isKeepAliveEnabled)
{
lock (this)
{
var state = State;
if (state == ConnectorState.Ready ||
state == ConnectorState.Closed ||
state == ConnectorState.Connecting ||
state == ConnectorState.Broken) return;
//...
Issue Analytics
- State:
- Created 3 years ago
- Comments:18 (18 by maintainers)
@neyromant just something that came to my mind.
From your stack trace, which is peculiar, as it should happen only if there was an exception while executing a reader (refs). Which leads me to believe that we get the exception which npgsql doesn’t handle that well, and it leaves the connection is a bad state. And after that, we get another exception on the cleanup, which swallows the original one. I think, you should enable the efcore logging, so we can get a clean picture of the original exception.
I did attempt to replicate the issue, but so far no problems(