Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

very frequent applying configuration + occasional timeout

See original GitHub issue

Turning on Debug level logging shows that applying a configuration is done all the time even though there are no changes.

Here is how it looks:

As seen the above the fingerprint is always the same (as there are no changes) + apply config is true. That should always hit the first case below, with I wonder if its intended as the config is already correct.

                    var fingerprint = (ConfigurationStorage.ProposedConfiguration ?? ConfigurationStorage.ActiveConfiguration).Fingerprint;
                        Logger.IncomingConfiguration(fingerprint, config.Fingerprint, applyConfig);
                        switch ((config.Fingerprint == fingerprint, applyConfig))
                        {
                            case (true, true):
                                await ConfigurationStorage.ApplyAsync(token).ConfigureAwait(false);
                                break;
                            case (true, false):
                                break;
                            case (false, false):
                                await ConfigurationStorage.ProposeAsync(config).ConfigureAwait(false);
                                break;
                            case (false, true):
                                result = result with { Value = false };
                                break;
                        }

The occassional timeouts is why I found the above. A test cluster has hit 3 timeouts in 8 days (so not fast to reproduce). The warning level logging gives a single entry when the timeout occurs:

2023-06-20 00:21:57.7202|WARN|DotNext.Net.Cluster.Consensus.Raft.Tcp.TcpServer|Timeout occurred while processing request from 192.168.100.11:40896|System.Threading.Tasks.TaskCanceledException: A task was canceled.                                                                                                                                                                                                                
at DotNext.Net.Cluster.Consensus.Raft.Membership.ClusterConfigurationStorage`1.DotNext.Net.Cluster.Consensus.Raft.Membership.IClusterConfigurationStorage.ApplyAsync(CancellationToken token) in /_/src/cluster/DotNext.Net.Cluster/Net/Cluster/Consensus/Raft/Membership/ClusterConfigurationStorage.cs:line 144                                                                                                                 
at System.Runtime.CompilerServices.PoolingAsyncValueTaskMethodBuilder`1.StateMachineBox`1.System.Threading.Tasks.Sources.IValueTaskSource.GetResult(Int16 token)                                                 
at DotNext.Net.Cluster.Consensus.Raft.RaftCluster`1.AppendEntriesAsync[TEntry](ClusterMemberId sender, Int64 senderTerm, ILogEntryProducer`1 entries, Int64 prevLogIndex, Int64 prevLogTerm, Int64 commitIndex, IClusterConfiguration config, Boolean applyConfig, CancellationToken token) in /_/src/cluster/DotNext.Net.Cluster/Net/Cluster/Consensus/Raft/RaftCluster.cs:line 629                                              
at System.Runtime.CompilerServices.PoolingAsyncValueTaskMethodBuilder`1.StateMachineBox`1.System.Threading.Tasks.Sources.IValueTaskSource<TResult>.GetResult(Int16 token)                                        
at DotNext.Net.Cluster.Consensus.Raft.TransportServices.ConnectionOriented.Server.AppendEntriesAsync(ProtocolStream protocol, CancellationToken token) in /_/src/cluster/DotNext.Net.Cluster/Net/Cluster/Consensus/Raft/TransportServices/ConnectionOriented/Server.cs:line 121                                                                                                                                                   
at System.Runtime.CompilerServices.PoolingAsyncValueTaskMethodBuilder`1.StateMachineBox`1.System.Threading.Tasks.Sources.IValueTaskSource.GetResult(Int16 token)                                                 
at DotNext.Net.Cluster.Consensus.Raft.Tcp.TcpServer.HandleConnection(Socket remoteClient) in /_/src/cluster/DotNext.Net.Cluster/Net/Cluster/Consensus/Raft/Tcp/TcpServer.cs:line 135

Since I am running with a fixed member configuration loaded at start, it is extra odd to see timeouts applying configuration during append entries as the configuration is always the same and there should be nothing to apply, right?

The other thing that I find odd there is I thought the InMemoryClusterConfigurationStorage is only holding the config in memory, so how/why it would take long enough in ApplyAsync to trigger a timeout. Is it possible there is some odd locking issue with this area of the implementation, perhaps related to any events raised while the lock is taken in there.

Issue Analytics

State:
Created 3 months ago
Comments:9 (2 by maintainers)

Top GitHub Comments

1reaction

saknocommented, Jul 10, 2023

You can also measure broadcast-time over the long period of time to ensure that leader has enough time to send heartbeats.

1reaction

freddyrioscommented, Jul 10, 2023

Thanks for looking into it, I thought the stack was certainly weird but did not occur why that could be.

By the way, I was trying to collect the data but early attempts were not reproducing again and then other stuff got in the way.

When I get back to this I will try to find some better info on what’s going on, as AppendEntries in that environment should not have been hitting the large timeouts in any case. Might even be related to an issue we saw in a different cluster with smaller timeouts where there was a 15 seconds window were all AppendEntries retries for a specific node kept timing out (and then resumed normal operation after that).