question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Some nodes stops participating in leader elections with tcp transport

See original GitHub issue

Hi, I try to start example/RaftNode and saw what some nodes can stop participating in leader elections. What can lead to this behavior?

.NET Core SDK (reflecting any global.json): Version: 3.1.401 Commit: 5b6f5e5005

Runtime Environment: OS Name: Windows OS Version: 10.0.17763 OS Platform: Windows RID: win10-x64 Base Path: C:\Program Files\dotnet\sdk\3.1.401\

Host (useful for support): Version: 3.1.7 Commit: fcfdef8d6b

The host also has the Hyper-V role installed, if relevant.

node1: cmd: RaftNode.exe tcp 3262

New cluster leader is elected. Leader address is 127.0.0.1:3264
Term of local cluster member is 911. Election timeout 00:00:00.2750000
Consensus cannot be reached
Term of local cluster member is 911. Election timeout 00:00:00.2750000
New cluster leader is elected. Leader address is 127.0.0.1:3264
Term of local cluster member is 913. Election timeout 00:00:00.2330000
Consensus cannot be reached
Term of local cluster member is 913. Election timeout 00:00:00.2330000
New cluster leader is elected. Leader address is 127.0.0.1:3264
Term of local cluster member is 919. Election timeout 00:00:00.2880000

node2: cmd: RaftNode.exe tcp 3263

New cluster leader is elected. Leader address is 127.0.0.1:3264
Term of local cluster member is 911. Election timeout 00:00:00.2720000
Consensus cannot be reached
Term of local cluster member is 911. Election timeout 00:00:00.2720000
New cluster leader is elected. Leader address is 127.0.0.1:3264
Term of local cluster member is 913. Election timeout 00:00:00.2720000
Consensus cannot be reached
Term of local cluster member is 913. Election timeout 00:00:00.2720000
New cluster leader is elected. Leader address is 127.0.0.1:3264
Term of local cluster member is 919. Election timeout 00:00:00.1820000

node3: Problem node cmd: RaftNode.exe tcp 3264

Consensus cannot be reached
Term of local cluster member is 465. Election timeout 00:00:00.1930000
New cluster leader is elected. Leader address is 127.0.0.1:3263
Term of local cluster member is 469. Election timeout 00:00:00.2030000
Consensus cannot be reached
Term of local cluster member is 469. Election timeout 00:00:00.2030000

In fact, I have seen the same behavior when trying to use RaftCluster with TCP transport in our project and cannot find the problem. So I tried to reproduce the problem with the example project.

Another problem is that after a while, the worker node stops with an assertion error:

Process terminated. Assertion failed.
   at DotNext.Net.Cluster.Consensus.Raft.TransportServices.ClientExchange.ProcessInboundMessageAsync(PacketHeaders headers, ReadOnlyMemory`1 payload, EndPoint sender, CancellationToken token) in g:\work\VSNET\github\dotNext\src\cluster\DotNext.Net.Cluster\Net\Cluster\Consensus\Raft\TransportServices\ClientExchange.cs:line 70
   at DotNext.Net.Cluster.Consensus.Raft.Tcp.TcpClient.ClientNetworkStream.Exchange(IExchange exchange, Memory`1 buffer, CancellationToken token) in g:\work\VSNET\github\dotNext\src\cluster\DotNext.Net.Cluster\Net\Cluster\Consensus\Raft\Tcp\TcpClient.cs:line 59
   at System.Runtime.CompilerServices.AsyncMethodBuilderCore.Start[TStateMachine](TStateMachine& stateMachine)
   at DotNext.Net.Cluster.Consensus.Raft.Tcp.TcpClient.ClientNetworkStream.Exchange(IExchange exchange, Memory`1 buffer, CancellationToken token)
   at DotNext.Net.Cluster.Consensus.Raft.Tcp.TcpClient.Enqueue(IExchange exchange, CancellationToken token) in g:\work\VSNET\github\dotNext\src\cluster\DotNext.Net.Cluster\Net\Cluster\Consensus\Raft\Tcp\TcpClient.cs:line 139
   at System.Runtime.CompilerServices.AsyncMethodBuilderCore.Start[TStateMachine](TStateMachine& stateMachine)
   at DotNext.Net.Cluster.Consensus.Raft.Tcp.TcpClient.Enqueue(IExchange exchange, CancellationToken token)
   at DotNext.Net.Cluster.Consensus.Raft.TransportServices.ExchangePeer.SendAsync[TResult,TExchange](TExchange exchange, CancellationToken token) in g:\work\VSNET\github\dotNext\src\cluster\DotNext.Net.Cluster\Net\Cluster\Consensus\Raft\TransportServices\ExchangePeer.cs:line 48
   at System.Runtime.CompilerServices.AsyncMethodBuilderCore.Start[TStateMachine](TStateMachine& stateMachine)
   at DotNext.Net.Cluster.Consensus.Raft.TransportServices.ExchangePeer.SendAsync[TResult,TExchange](TExchange exchange, CancellationToken token)
   at DotNext.Net.Cluster.Consensus.Raft.TransportServices.ExchangePeer.VoteAsync(Int64 term, Int64 lastLogIndex, Int64 lastLogTerm, CancellationToken token) in g:\work\VSNET\github\dotNext\src\cluster\DotNext.Net.Cluster\Net\Cluster\Consensus\Raft\TransportServices\ExchangePeer.cs:line 68
   at DotNext.Net.Cluster.Consensus.Raft.RaftClusterMember.DotNext.Net.Cluster.Consensus.Raft.IRaftClusterMember.VoteAsync(Int64 term, Int64 lastLogIndex, Int64 lastLogTerm, CancellationToken token) in g:\work\VSNET\github\dotNext\src\cluster\DotNext.Net.Cluster\Net\Cluster\Consensus\Raft\RaftClusterMember.cs:line 78
   at DotNext.Net.Cluster.Consensus.Raft.CandidateState.VotingState.VoteAsync(IRaftClusterMember voter, Int64 term, IAuditTrail`1 auditTrail, CancellationToken token) in g:\work\VSNET\github\dotNext\src\cluster\DotNext.Net.Cluster\Net\Cluster\Consensus\Raft\CandidateState.cs:line 35
   at System.Runtime.CompilerServices.AsyncMethodBuilderCore.Start[TStateMachine](TStateMachine& stateMachine)
   at DotNext.Net.Cluster.Consensus.Raft.CandidateState.VotingState.VoteAsync(IRaftClusterMember voter, Int64 term, IAuditTrail`1 auditTrail, CancellationToken token)
   at DotNext.Net.Cluster.Consensus.Raft.CandidateState.VotingState..ctor(IRaftClusterMember voter, Int64 term, IAuditTrail`1 auditTrail, CancellationToken token) in g:\work\VSNET\github\dotNext\src\cluster\DotNext.Net.Cluster\Net\Cluster\Consensus\Raft\CandidateState.cs:line 55
   at DotNext.Net.Cluster.Consensus.Raft.CandidateState.StartVoting(Int32 timeout, IAuditTrail`1 auditTrail) in g:\work\VSNET\github\dotNext\src\cluster\DotNext.Net.Cluster\Net\Cluster\Consensus\Raft\CandidateState.cs:line 131
   at DotNext.Net.Cluster.Consensus.Raft.RaftCluster`1.DotNext.Net.Cluster.Consensus.Raft.IRaftStateMachine.MoveToCandidateState() in g:\work\VSNET\github\dotNext\src\cluster\DotNext.Net.Cluster\Net\Cluster\Consensus\Raft\RaftCluster.cs:line 661
   at System.Runtime.CompilerServices.AsyncMethodBuilderCore.Start[TStateMachine](TStateMachine& stateMachine)
   at DotNext.Net.Cluster.Consensus.Raft.RaftCluster`1.DotNext.Net.Cluster.Consensus.Raft.IRaftStateMachine.MoveToCandidateState()
   at DotNext.Net.Cluster.Consensus.Raft.FollowerState.Track(TimeSpan timeout, IAsyncEvent refreshEvent, Action candidateState, CancellationToken[] tokens) in g:\work\VSNET\github\dotNext\src\cluster\DotNext.Net.Cluster\Net\Cluster\Consensus\Raft\FollowerState.cs:line 34
   at System.Threading.ExecutionContext.RunInternal(ExecutionContext executionContext, ContextCallback callback, Object state)
   at System.Runtime.CompilerServices.AsyncTaskMethodBuilder`1.AsyncStateMachineBox`1.MoveNext(Thread threadPoolThread)
   at System.Threading.Tasks.AwaitTaskContinuation.RunOrScheduleAction(IAsyncStateMachineBox box, Boolean allowInlining)
   at System.Threading.Tasks.Task.RunContinuations(Object continuationObject)
   at System.Threading.Tasks.Task`1.TrySetResult(TResult result)
   at System.Runtime.CompilerServices.AsyncTaskMethodBuilder`1.SetResult(TResult result)
   at DotNext.Threading.QueuedSynchronizer.WaitAsync(WaitNode node, TimeSpan timeout, CancellationToken token) in g:\work\VSNET\github\dotNext\src\DotNext.Threading\Threading\QueuedSynchronizer.cs:line 121
   at System.Threading.ExecutionContext.RunInternal(ExecutionContext executionContext, ContextCallback callback, Object state)
   at System.Runtime.CompilerServices.AsyncTaskMethodBuilder`1.AsyncStateMachineBox`1.MoveNext(Thread threadPoolThread)
   at System.Threading.Tasks.AwaitTaskContinuation.RunOrScheduleAction(IAsyncStateMachineBox box, Boolean allowInlining)
   at System.Threading.Tasks.Task.RunContinuations(Object continuationObject)
   at System.Threading.Tasks.Task`1.TrySetResult(TResult result)
   at System.Threading.Tasks.TaskFactory.CompleteOnInvokePromise.Invoke(Task completingTask)
   at System.Threading.Tasks.Task.RunContinuations(Object continuationObject)
   at System.Threading.Tasks.Task.TrySetResult()
   at System.Threading.Tasks.Task.DelayPromise.CompleteTimedOut()
   at System.Threading.TimerQueueTimer.CallCallback(Boolean isThreadPool)
   at System.Threading.TimerQueueTimer.Fire(Boolean isThreadPool)
   at System.Threading.TimerQueue.FireNextTimers()

Issue Analytics

  • State:closed
  • Created 3 years ago
  • Comments:10

github_iconTop GitHub Comments

1reaction
saknocommented, Aug 16, 2020

Patched version is now published on NuGet.

0reactions
Eykhlercommented, Aug 16, 2020

My bad. I’ve tested the develop branch from the fork. Now, after an hour of testing changes from the original project, I can’t reproduce the bug. Thank you for the quick fix!

Read more comments on GitHub >

github_iconTop Results From Across the Web

Leader Election and State Machine Replication in BlazingMQ
In the absence of any disturbance, all 4 nodes connect to each other over TCP, creating full mesh. However, let's assume that due...
Read more >
Restarting one of etcd members triggers leader election
All nodes can restart on upgrades. And so long as the leader still holds its lease, restarted follower. won't trigger leader elections.
Read more >
Run multiple instances and use Leader election using Raft ...
Describe the proposal Currently Placement service runs as single instance and can lead to unavailability of it when node goes down.
Read more >
Design and Analysis of a Leader Election Algorithm for ...
Upon receiving this Election message, node A stops participating in its current computation, sets its computation-index to (3,D), as shown in Figure 2(b),...
Read more >
Raft Algorithm, Explained - Leader Election
Without any election coordination, nothing can stop Raft nodes from running multiple elections. Since each node votes for itself (by design) ...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found