Gracefully recover from out of date or corrupted dqlite member
See original GitHub issueHello,
As always, thanks for your hard work on microk8s!
I have been running microk8s in a 3 node HA test-bed for a few days on the latest snap (from latest/edge) and eventually, after a few days of operation, one of the dqlite members has fallen out of sync. We have dbctl
for taking backups and restoring DB state, which is great, however I haven’t been able to pinpoint exactly why or when a node is going to fall out of sync, so I can’t be sure exactly under which circumstances this happens. I have set up various telemetry tools, and I can tell that there is no significant IO, memory or CPU pressure on the nodes, it looks a though the cluster just collapses. I have attached a microk8s inspect from an impacted node. The issue appears to point solely back to dqlite, as the apiserver and kubelet no longer seem to be able to communicate with dqlite, causing all manner of failures. I’m happy to dig into this issue separately if it occurs again, and will raise a separate issue, hopefully with more data about what caused the initial failure.
However, the more interesting thing here, which I would like to focus on, is that after a rolling restart of the cluster, to try and get the dqlite cluster healthy, failed. I’ve done this in the past when this issue has cropped up (the failure of a single node’s dqlite killing the control plane) however this time, it wouldn’t work. Each node would time out trying to connect to port 19001 and eventually the apiserver would fail to start with a context cancelled
error - i.e. a timeout. strace
of a process shows an attempt to connect over the network to the cluster port of another node, (let’s call this node, node1). Inspecting node1 shows it is crashing with SIGSEGV. I dug through the DB directory, and the modification dates on all files show as today-2days, whilst the files under /var/snap/microk8s/current/var/kubernetes/backend
on the other two nodes (let’s call them, node2 and node3) both show newer data with similar dqlite files in place. All nodes have the correct cluster keys and configuration per the info.yaml and cluster.yaml, just the data differs. Clearing the data on node1 (the crashing node) and node3 and restoring it from the contents of the /var/snap/microk8s/current/var/kubernetes/backend
directory on node2, making sure to preserve the info.yaml, allowed me to “repair” the cluster and start it back up, after which all services were available again.
I can make a copy of the DB available for reference, it was too large to upload to this issue. However, I think the feature which is missing here, is some level of consistency checking on the database files. Crashing with SIGSEGV suggests to me that any form of corruption here will prevent the cluster from starting up after a cluster-wide failure, which I think is unexpected behaviour for a HA configuration. Additionally, some documentation for recovery would be really useful for operators of microk8s clusters, as restoring dqlite clusters is something that is not documented anywhere I could find. I’d be happy to contribute that documentation if you point me in the right direction, as I essentially had to reverse-engineer the way dqlite works in order to repair the cluster, it’s fairly fresh in my mind.
Inspect from one of the nodes failing to start - sig-segv-inspection.tar.gz
Hope this is all useful, happy to provide any further information, or access to the environment.
Issue Analytics
- State:
- Created 3 years ago
- Reactions:2
- Comments:12 (1 by maintainers)
Top GitHub Comments
My understanding was that there were some additional corruptions that @freeekanayaka noticed in the dumps I provided, and the hope was that identifying those and potentially being able to recover from them would see this being able to be closed out. I have still been seeing periodic corruption similar to that initially reported and have been following the recovery steps I posted on the discourse to correct, but I think being able to have microk8s/dqlite detect those corruptions and roll back problematic snapshots or checkpoints would be the ideal outcome here.
Sorry, my bad I just noticed that there are 2 sets of tarballs that @devec0 provided, and the
backend
directory is in the second set. Should have looked more carefully at the beginning 😃 Looking now.