very frequent applying configuration + occasional timeout
See original GitHub issueTurning on Debug level logging shows that applying a configuration is done all the time even though there are no changes.
Here is how it looks:
As seen the above the fingerprint is always the same (as there are no changes) + apply config is true. That should always hit the first case below, with I wonder if its intended as the config is already correct.
var fingerprint = (ConfigurationStorage.ProposedConfiguration ?? ConfigurationStorage.ActiveConfiguration).Fingerprint;
Logger.IncomingConfiguration(fingerprint, config.Fingerprint, applyConfig);
switch ((config.Fingerprint == fingerprint, applyConfig))
{
case (true, true):
await ConfigurationStorage.ApplyAsync(token).ConfigureAwait(false);
break;
case (true, false):
break;
case (false, false):
await ConfigurationStorage.ProposeAsync(config).ConfigureAwait(false);
break;
case (false, true):
result = result with { Value = false };
break;
}
The occassional timeouts is why I found the above. A test cluster has hit 3 timeouts in 8 days (so not fast to reproduce). The warning level logging gives a single entry when the timeout occurs:
2023-06-20 00:21:57.7202|WARN|DotNext.Net.Cluster.Consensus.Raft.Tcp.TcpServer|Timeout occurred while processing request from 192.168.100.11:40896|System.Threading.Tasks.TaskCanceledException: A task was canceled.
at DotNext.Net.Cluster.Consensus.Raft.Membership.ClusterConfigurationStorage`1.DotNext.Net.Cluster.Consensus.Raft.Membership.IClusterConfigurationStorage.ApplyAsync(CancellationToken token) in /_/src/cluster/DotNext.Net.Cluster/Net/Cluster/Consensus/Raft/Membership/ClusterConfigurationStorage.cs:line 144
at System.Runtime.CompilerServices.PoolingAsyncValueTaskMethodBuilder`1.StateMachineBox`1.System.Threading.Tasks.Sources.IValueTaskSource.GetResult(Int16 token)
at DotNext.Net.Cluster.Consensus.Raft.RaftCluster`1.AppendEntriesAsync[TEntry](ClusterMemberId sender, Int64 senderTerm, ILogEntryProducer`1 entries, Int64 prevLogIndex, Int64 prevLogTerm, Int64 commitIndex, IClusterConfiguration config, Boolean applyConfig, CancellationToken token) in /_/src/cluster/DotNext.Net.Cluster/Net/Cluster/Consensus/Raft/RaftCluster.cs:line 629
at System.Runtime.CompilerServices.PoolingAsyncValueTaskMethodBuilder`1.StateMachineBox`1.System.Threading.Tasks.Sources.IValueTaskSource<TResult>.GetResult(Int16 token)
at DotNext.Net.Cluster.Consensus.Raft.TransportServices.ConnectionOriented.Server.AppendEntriesAsync(ProtocolStream protocol, CancellationToken token) in /_/src/cluster/DotNext.Net.Cluster/Net/Cluster/Consensus/Raft/TransportServices/ConnectionOriented/Server.cs:line 121
at System.Runtime.CompilerServices.PoolingAsyncValueTaskMethodBuilder`1.StateMachineBox`1.System.Threading.Tasks.Sources.IValueTaskSource.GetResult(Int16 token)
at DotNext.Net.Cluster.Consensus.Raft.Tcp.TcpServer.HandleConnection(Socket remoteClient) in /_/src/cluster/DotNext.Net.Cluster/Net/Cluster/Consensus/Raft/Tcp/TcpServer.cs:line 135
Since I am running with a fixed member configuration loaded at start, it is extra odd to see timeouts applying configuration during append entries as the configuration is always the same and there should be nothing to apply, right?
The other thing that I find odd there is I thought the InMemoryClusterConfigurationStorage is only holding the config in memory, so how/why it would take long enough in ApplyAsync to trigger a timeout. Is it possible there is some odd locking issue with this area of the implementation, perhaps related to any events raised while the lock is taken in there.
Issue Analytics
- State:
- Created 3 months ago
- Comments:9 (2 by maintainers)
You can also measure
broadcast-time
over the long period of time to ensure that leader has enough time to send heartbeats.Thanks for looking into it, I thought the stack was certainly weird but did not occur why that could be.
By the way, I was trying to collect the data but early attempts were not reproducing again and then other stuff got in the way.
When I get back to this I will try to find some better info on what’s going on, as AppendEntries in that environment should not have been hitting the large timeouts in any case. Might even be related to an issue we saw in a different cluster with smaller timeouts where there was a 15 seconds window were all AppendEntries retries for a specific node kept timing out (and then resumed normal operation after that).