question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

AlluxioMaster leader switch unexpectedly after manual leader switch with "quorum elect" command

See original GitHub issue

Alluxio Version: v2.6.2

Describe the bug With embedded journal system, once using ‘fsadmin journal quorum elect -address <xxxx>’, nodes priorities were changed and not reset, hence, if the given node temporarily offline and back online sometime later, the leader was switched again immediately after the given node back online.

To Reproduce

  1. manual switch leader to NodeA ./bin/alluxio fsadmin journal quorum elect -address <node_a>:19200
  2. check the priority info of nodes by ./bin/alluxio fsadmin journal quorum info -domain MASTER
  3. Take down AlluxioMaster process on NodeA, wait for leader role fail overs to one of remain two nodes.
  4. Once the remain two nodes reaches final state as one LEADER, one FOLLOWER, take NodeA online again
  5. A leader switch happens again once NodeA is online.

Expected behavior in step5, the leader switch is an unexpected operation.

Urgency Describe the impact and urgency of the bug.

Additional context

Should be a side effect of fsadmin journal quorum elect -address <xxxx> who sets priorities to master nodes, but doesn’t clean the priority after operation done.

Suggest fix:

  • restore or reset the priority info by calling resetPriorities in QuorumElectCommand or calling resetPriorities in RaftJournalSystem::transferLeadership after leadership is transferred.

Issue Analytics

  • State:closed
  • Created 2 years ago
  • Comments:5 (5 by maintainers)

github_iconTop GitHub Comments

2reactions
jenoudetcommented, Oct 11, 2021

I was able to reproduce and confirm this behavior. Looking into it.

1reaction
codings-dancommented, Oct 12, 2021

@jenoudet @kevincai The change of node priority will indeed affect the result of the next election. When the quroum elect command is designed at the beginning, the resetPriority command will be called after the switch is completed. But I think the resetPriority command also needs to monitor whether the priority is reset successfully. If this process is all implemented in resetPriorities in QuorumElectCommand or RaftJournalSystem::transferLeadership, the switching time may be too long. A compromise solution is to manually reset the priority after the switch is completed. This has been implemented in #14031 , for reference, thx.

Read more comments on GitHub >

github_iconTop Results From Across the Web

Journal Management - Alluxio v2.9.0 (stable) Documentation
Changing masters. Embedded Journal Cluster. When internal leader election is used, Alluxio masters are determined with a quorum. Adding or removing masters ......
Read more >
Admin Command Line Interface - Alluxio v2.9.0 (stable ...
Alluxio's admin command line interface provides admins with operations to manage the Alluxio ... Elect a specific member of the quorum as the...
Read more >
List of Configuration Properties - Alluxio v2.9.0 (stable ...
Property Name Default Description alluxio.conf.dynamic.update.enabled false Whether to support dynamic update property. alluxio.cross.cluster.master.hostname $ The hostname of the Cross Cluster master. alluxio.cross.cluster.master.web.port 20010 The port the...
Read more >
User Command Line Interface - Alluxio v2.9.0 (stable ...
leader : Prints the hostname of the job master service leader. ls : Prints the IDs ... Users are able to change Alluxio...
Read more >
Deploy Alluxio on a Cluster with HA - Introduction
Standby masters do not serve any client or worker requests; however, if the leading master fails, one standby master will automatically be elected...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found