question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Gracefully recover from out of date or corrupted dqlite member

See original GitHub issue

Hello,

As always, thanks for your hard work on microk8s!

I have been running microk8s in a 3 node HA test-bed for a few days on the latest snap (from latest/edge) and eventually, after a few days of operation, one of the dqlite members has fallen out of sync. We have dbctl for taking backups and restoring DB state, which is great, however I haven’t been able to pinpoint exactly why or when a node is going to fall out of sync, so I can’t be sure exactly under which circumstances this happens. I have set up various telemetry tools, and I can tell that there is no significant IO, memory or CPU pressure on the nodes, it looks a though the cluster just collapses. I have attached a microk8s inspect from an impacted node. The issue appears to point solely back to dqlite, as the apiserver and kubelet no longer seem to be able to communicate with dqlite, causing all manner of failures. I’m happy to dig into this issue separately if it occurs again, and will raise a separate issue, hopefully with more data about what caused the initial failure.

However, the more interesting thing here, which I would like to focus on, is that after a rolling restart of the cluster, to try and get the dqlite cluster healthy, failed. I’ve done this in the past when this issue has cropped up (the failure of a single node’s dqlite killing the control plane) however this time, it wouldn’t work. Each node would time out trying to connect to port 19001 and eventually the apiserver would fail to start with a context cancelled error - i.e. a timeout. strace of a process shows an attempt to connect over the network to the cluster port of another node, (let’s call this node, node1). Inspecting node1 shows it is crashing with SIGSEGV. I dug through the DB directory, and the modification dates on all files show as today-2days, whilst the files under /var/snap/microk8s/current/var/kubernetes/backend on the other two nodes (let’s call them, node2 and node3) both show newer data with similar dqlite files in place. All nodes have the correct cluster keys and configuration per the info.yaml and cluster.yaml, just the data differs. Clearing the data on node1 (the crashing node) and node3 and restoring it from the contents of the /var/snap/microk8s/current/var/kubernetes/backend directory on node2, making sure to preserve the info.yaml, allowed me to “repair” the cluster and start it back up, after which all services were available again.

I can make a copy of the DB available for reference, it was too large to upload to this issue. However, I think the feature which is missing here, is some level of consistency checking on the database files. Crashing with SIGSEGV suggests to me that any form of corruption here will prevent the cluster from starting up after a cluster-wide failure, which I think is unexpected behaviour for a HA configuration. Additionally, some documentation for recovery would be really useful for operators of microk8s clusters, as restoring dqlite clusters is something that is not documented anywhere I could find. I’d be happy to contribute that documentation if you point me in the right direction, as I essentially had to reverse-engineer the way dqlite works in order to repair the cluster, it’s fairly fresh in my mind.

Inspect from one of the nodes failing to start - sig-segv-inspection.tar.gz

Hope this is all useful, happy to provide any further information, or access to the environment.

Issue Analytics

  • State:closed
  • Created 3 years ago
  • Reactions:2
  • Comments:12 (1 by maintainers)

github_iconTop GitHub Comments

1reaction
devec0commented, Oct 27, 2020

My understanding was that there were some additional corruptions that @freeekanayaka noticed in the dumps I provided, and the hope was that identifying those and potentially being able to recover from them would see this being able to be closed out. I have still been seeing periodic corruption similar to that initially reported and have been following the recovery steps I posted on the discourse to correct, but I think being able to have microk8s/dqlite detect those corruptions and roll back problematic snapshots or checkpoints would be the ideal outcome here.

1reaction
freeekanayakacommented, Sep 23, 2020

Sorry, my bad I just noticed that there are 2 sets of tarballs that @devec0 provided, and the backend directory is in the second set. Should have looked more carefully at the beginning 😃 Looking now.

Read more comments on GitHub >

github_iconTop Results From Across the Web

Gracefully recover from out of date or corrupted dqlite member
I have been running microk8s in a 3 node HA test-bed for a few days on the latest snap (from latest/edge) and eventually,...
Read more >
android - Room and SQLiteDatabaseCorruptException
Anyone knows a good way to deal with these? Well, if your database is corrupted, it's corrupted. You probably won't be able to...
Read more >
Recovering Data From A Corrupt SQLite Database
Recovery will extract as much usable data as it can from the wreck of the old database, but some parts may be damaged...
Read more >
Backup/Restore DATA for weewx - Google Groups
I am not an expert but sqlite itself has some means for database backup. ... it automaticly remove old backups (more then 1...
Read more >
Fix possible heap corruption in QXmlStream (Ib3ab2662)
Fix possible heap corruption in QXmlStream The value of 'tos' at the check might already be on the last element, so triggering stack...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found