Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Leader election and dqlite errors when recovering nodes in HA cluster

See original GitHub issue

Hello,

We have a HA cluster setup with three nodes, each using version 1.21.7:

NAME                    STATUS   ROLES    AGE   VERSION                    INTERNAL-IP      EXTERNAL-IP   OS-IMAGE             KERNEL-VERSION     CONTAINER-RUNTIME
masterA                 Ready    <none>   37m   v1.21.7-3+7700880a5c71e2   X.X.X.X          <none>        Ubuntu 18.04.6 LTS   5.4.0-79-generic   containerd://1.4.4
masterB                 Ready    <none>   32m   v1.21.7-3+7700880a5c71e2   X.X.X.X          <none>        Ubuntu 18.04.6 LTS   5.4.0-79-generic   containerd://1.4.4
masterC                 Ready    <none>   42m   v1.21.7-3+7700880a5c71e2   X.X.X.X          <none>        Ubuntu 18.04.6 LTS   5.4.0-79-generic   containerd://1.4.4

We came across an issue when two nodes masterA and masterB were removed ungracefully and shutdown. The elected leader node was masterA. The following errors occurred around the same time on the remaining node masterC:

leaderelection.go:325] error retrieving resource lock kube-system/kube-scheduler: Get "https://127.0.0.1:16443/apis/coordination.k8s.io/v1/namespaces/kube-system/leases/kube-scheduler?timeout=15s": net/http: request canceled (Client.Timeout exceeded while awaiting headers)

apiserver received an error that is not an metav1.Status: &errors.errorString{s:"context canceled"}: context canceled
apiserver was unable to write a JSON response: http: Handler timeout
apiserver received an error that is not an metav1.Status: &errors.errorString{s:"http: Handler timeout"}: http: Handler timeout
apiserver was unable to write a fallback JSON response: http: Handler timeout

At this point, the elected leader node remained the same (masterA) which is still powered-off; however, when we powered-on masterB, it failed to start the kubelite service:

microk8s.daemon-kubelite[8542]: Error: start node: raft_start(): io: load closed segment 0000000001324915-0000000001325281: entries batch 52 starting at byte 1041968: data checksum mismatch

Could the segment be corrupt or would this suggest that it cannot sync the dqlite files to the elected leader masterA that is still unavailable? If so is there a way we can validate the checksum? In order to recover masterB, I had to delete the mentioned dqlite file and restart the kubelite service. Once masterB and masterC were available, a new leader node was elected (masterB) and we was able to recover the cluster.

Checking the HA documentation, with only a single node available, would this render the cluster inoperable? Essentially, would we need more than one node available at anytime?

There was a few other suggestions such as increasing the following arguments in the kube-scheduler and kube-controller-manager (source):

--leader-elect-lease-duration=60s
--leader-elect-renew-deadline=40s

A number of comments mentioned that they had the same issue using microk8s v1.21. The last potential issue was a “resource crunch or network issue” mentioned here. We have not yet been able to replicate the issue but would appreciate if anyone could shed some light on this.

Issue Analytics

State:
Created 2 years ago
Comments:6 (2 by maintainers)

Top GitHub Comments

6reactions

MathieuBorderecommented, Jan 10, 2022

@bc185174, the error raft_start(): io: load closed segment 0000000001324915-0000000001325281: entries batch 52 starting at byte 1041968: data checksum mismatch indicates some form of data corruption. This probably happened because of the unclean way the node was taken down. Maybe @MathieuBordere knows if there are any plans to perform some kind of (semi) automated “fsck” on the data and recover from such cases.

It’s not planned immediately but was already discussed, and imo is useful to add. Will try to do it within a reasonable timeframe.

0reactions

stale[bot]commented, Dec 6, 2022

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.