question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

very frequent applying configuration + occasional timeout

See original GitHub issue

Turning on Debug level logging shows that applying a configuration is done all the time even though there are no changes.

Here is how it looks: image

As seen the above the fingerprint is always the same (as there are no changes) + apply config is true. That should always hit the first case below, with I wonder if its intended as the config is already correct.

                    var fingerprint = (ConfigurationStorage.ProposedConfiguration ?? ConfigurationStorage.ActiveConfiguration).Fingerprint;
                        Logger.IncomingConfiguration(fingerprint, config.Fingerprint, applyConfig);
                        switch ((config.Fingerprint == fingerprint, applyConfig))
                        {
                            case (true, true):
                                await ConfigurationStorage.ApplyAsync(token).ConfigureAwait(false);
                                break;
                            case (true, false):
                                break;
                            case (false, false):
                                await ConfigurationStorage.ProposeAsync(config).ConfigureAwait(false);
                                break;
                            case (false, true):
                                result = result with { Value = false };
                                break;
                        }

The occassional timeouts is why I found the above. A test cluster has hit 3 timeouts in 8 days (so not fast to reproduce). The warning level logging gives a single entry when the timeout occurs:

2023-06-20 00:21:57.7202|WARN|DotNext.Net.Cluster.Consensus.Raft.Tcp.TcpServer|Timeout occurred while processing request from 192.168.100.11:40896|System.Threading.Tasks.TaskCanceledException: A task was canceled.                                                                                                                                                                                                                
at DotNext.Net.Cluster.Consensus.Raft.Membership.ClusterConfigurationStorage`1.DotNext.Net.Cluster.Consensus.Raft.Membership.IClusterConfigurationStorage.ApplyAsync(CancellationToken token) in /_/src/cluster/DotNext.Net.Cluster/Net/Cluster/Consensus/Raft/Membership/ClusterConfigurationStorage.cs:line 144                                                                                                                 
at System.Runtime.CompilerServices.PoolingAsyncValueTaskMethodBuilder`1.StateMachineBox`1.System.Threading.Tasks.Sources.IValueTaskSource.GetResult(Int16 token)                                                 
at DotNext.Net.Cluster.Consensus.Raft.RaftCluster`1.AppendEntriesAsync[TEntry](ClusterMemberId sender, Int64 senderTerm, ILogEntryProducer`1 entries, Int64 prevLogIndex, Int64 prevLogTerm, Int64 commitIndex, IClusterConfiguration config, Boolean applyConfig, CancellationToken token) in /_/src/cluster/DotNext.Net.Cluster/Net/Cluster/Consensus/Raft/RaftCluster.cs:line 629                                              
at System.Runtime.CompilerServices.PoolingAsyncValueTaskMethodBuilder`1.StateMachineBox`1.System.Threading.Tasks.Sources.IValueTaskSource<TResult>.GetResult(Int16 token)                                        
at DotNext.Net.Cluster.Consensus.Raft.TransportServices.ConnectionOriented.Server.AppendEntriesAsync(ProtocolStream protocol, CancellationToken token) in /_/src/cluster/DotNext.Net.Cluster/Net/Cluster/Consensus/Raft/TransportServices/ConnectionOriented/Server.cs:line 121                                                                                                                                                   
at System.Runtime.CompilerServices.PoolingAsyncValueTaskMethodBuilder`1.StateMachineBox`1.System.Threading.Tasks.Sources.IValueTaskSource.GetResult(Int16 token)                                                 
at DotNext.Net.Cluster.Consensus.Raft.Tcp.TcpServer.HandleConnection(Socket remoteClient) in /_/src/cluster/DotNext.Net.Cluster/Net/Cluster/Consensus/Raft/Tcp/TcpServer.cs:line 135         

Since I am running with a fixed member configuration loaded at start, it is extra odd to see timeouts applying configuration during append entries as the configuration is always the same and there should be nothing to apply, right?

The other thing that I find odd there is I thought the InMemoryClusterConfigurationStorage is only holding the config in memory, so how/why it would take long enough in ApplyAsync to trigger a timeout. Is it possible there is some odd locking issue with this area of the implementation, perhaps related to any events raised while the lock is taken in there.

Issue Analytics

  • State:closed
  • Created 3 months ago
  • Comments:9 (2 by maintainers)

github_iconTop GitHub Comments

1reaction
saknocommented, Jul 10, 2023

You can also measure broadcast-time over the long period of time to ensure that leader has enough time to send heartbeats.

1reaction
freddyrioscommented, Jul 10, 2023

Thanks for looking into it, I thought the stack was certainly weird but did not occur why that could be.

By the way, I was trying to collect the data but early attempts were not reproducing again and then other stuff got in the way.

When I get back to this I will try to find some better info on what’s going on, as AppendEntries in that environment should not have been hitting the large timeouts in any case. Might even be related to an issue we saw in a different cluster with smaller timeouts where there was a 15 seconds window were all AppendEntries retries for a specific node kept timing out (and then resumed normal operation after that).

Read more comments on GitHub >

github_iconTop Results From Across the Web

Frequent TimeoutException from Java SDK
For the RequestCancelledException, it usually indicates that the SDK has detected a connection to a server process has been disconnected, and ...
Read more >
How to Get to the Bottom of Network Timeout Issues
Often timeouts can be hard to understand and troubleshoot. This can be due to there being different service forms, or different software ...
Read more >
Frequent, "random" SQL Server connection timeouts
Basically, when you failed to connect to your SQL Server, the issue could be: Network issue. SQL Server configuration issue. Firewall issue.
Read more >
Connection timeout. A timeout occurred during execution ...
i just applied the template "Windows Server 2016 Services and Counters" to a few 2016 server and received same result "Connection timeout. A...
Read more >
Guidelines to handle Timeout exception for Kafka Producer?
I am using all the default values for producer config currently. I have seen following Timeout exceptions: org.apache.kafka.common.errors.
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found