AlluxioMaster leader switch unexpectedly after manual leader switch with "quorum elect" command
See original GitHub issueAlluxio Version: v2.6.2
Describe the bug With embedded journal system, once using ‘fsadmin journal quorum elect -address <xxxx>’, nodes priorities were changed and not reset, hence, if the given node temporarily offline and back online sometime later, the leader was switched again immediately after the given node back online.
To Reproduce
- manual switch leader to NodeA
./bin/alluxio fsadmin journal quorum elect -address <node_a>:19200
- check the priority info of nodes by
./bin/alluxio fsadmin journal quorum info -domain MASTER
- Take down AlluxioMaster process on NodeA, wait for leader role fail overs to one of remain two nodes.
- Once the remain two nodes reaches final state as one LEADER, one FOLLOWER, take NodeA online again
- A leader switch happens again once NodeA is online.
Expected behavior in step5, the leader switch is an unexpected operation.
Urgency Describe the impact and urgency of the bug.
Additional context
Should be a side effect of fsadmin journal quorum elect -address <xxxx>
who sets priorities to master nodes, but doesn’t clean the priority after operation done.
Suggest fix:
- restore or reset the priority info by calling
resetPriorities
inQuorumElectCommand
or callingresetPriorities
inRaftJournalSystem::transferLeadership
after leadership is transferred.
Issue Analytics
- State:
- Created 2 years ago
- Comments:5 (5 by maintainers)
Top Results From Across the Web
Journal Management - Alluxio v2.9.0 (stable) Documentation
Changing masters. Embedded Journal Cluster. When internal leader election is used, Alluxio masters are determined with a quorum. Adding or removing masters ......
Read more >Admin Command Line Interface - Alluxio v2.9.0 (stable ...
Alluxio's admin command line interface provides admins with operations to manage the Alluxio ... Elect a specific member of the quorum as the...
Read more >List of Configuration Properties - Alluxio v2.9.0 (stable ...
Property Name Default Description
alluxio.conf.dynamic.update.enabled false Whether to support dynamic update property.
alluxio.cross.cluster.master.hostname $ The hostname of the Cross Cluster master.
alluxio.cross.cluster.master.web.port 20010 The port the...
Read more >User Command Line Interface - Alluxio v2.9.0 (stable ...
leader : Prints the hostname of the job master service leader. ls : Prints the IDs ... Users are able to change Alluxio...
Read more >Deploy Alluxio on a Cluster with HA - Introduction
Standby masters do not serve any client or worker requests; however, if the leading master fails, one standby master will automatically be elected...
Read more >Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start FreeTop Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Top GitHub Comments
I was able to reproduce and confirm this behavior. Looking into it.
@jenoudet @kevincai The change of node priority will indeed affect the result of the next election. When the
quroum elect
command is designed at the beginning, the resetPriority command will be called after the switch is completed. But I think theresetPriority
command also needs to monitor whether the priority is reset successfully. If this process is all implemented in resetPriorities inQuorumElectCommand
orRaftJournalSystem::transferLeadership
, the switching time may be too long. A compromise solution is to manually reset the priority after the switch is completed. This has been implemented in #14031 , for reference, thx.