question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

dqlite recovery missing documentation and robustness

See original GitHub issue

I have run cluster (1.23/stable) in HA of 3 nodes, 2 were lost. I tried a recovery from the remaining node finding out things are fishy.

Ultimate question is there any way to connect to such cluster db and recover, without editing a code and custom build?

Subject of the issue:

  • no means to get real state, remove failed nodes from cluster and start it from single instance
  • –debug is not helpful on k8s-dqlite, missing critical informations on server start
  • k8s-dqlite always listen on 19001, no way to change it (by experience)
  • snap shipped dqlite fails to connect to broken cluster, nothing logged
  • documentation is very poor on any recovery or debugging bad state of dqlite
  • missing recovery commands (possibly --single flag to start localnode instance only would be ok at first)
  • “k8s” database name is sort of hardcoded? (how one shall know it? - not printed on server start but client requires it)

For further logs:

  • 172.31.1.11 - physical node for recovery (no damage on this node)
  • 172.31.1.12 - lost node
  • 172.31.1.13 - lost node

First the single node cluster fails to start

root@cmp1-nuc11:~# /snap/microk8s/2948/bin/k8s-dqlite --debug=true --enable-tls --storage-dir=/var/snap/microk8s/2948/var/kubernetes/backend
INFO[0000] Starting dqlite                              
I0201 06:43:58.631095  427506 log.go:181] Failure domain set to 1
I0201 06:43:58.631109  427506 log.go:181] TLS enabled
FATA[0300] Failed to start server: context deadline exceeded 

^^ fails some minutes later, no informative logs what port/clients/cluster state is till it fails 😕 ^^ see it open port, but never start listening and accepting clients

With debugging enabled

LIBRAFT   1643698280403872586 src/uv.c:470 start index 41490458, 8616 entries
LIBRAFT   1643698280403878194 src/start.c:152 current_term:29 voted_for:15326470583229310013 start_index:41490458 n_entries:8616
LIBRAFT   1643698280403882557 src/start.c:157 restore snapshot with last index 41499052 and last term 29
LIBDQLITE 1643698280403885569 fsm__restore:461 fsm restore
LIBDQLITE 1643698280403896910 db__init:18 db init k8s
LIBDQLITE 1643698280404038284 VfsRestore:2673 vfs restore filename k8s size 71391256
LIBRAFT   1643698280428892866 src/configuration.c:342 configuration restore from snapshot
LIBRAFT   1643698280428900325 src/configuration.c:343 === CONFIG START ===
LIBRAFT   1643698280428901857 src/configuration.c:348 id:15326470583229310013 address:172.31.1.11:19001 role:1
LIBRAFT   1643698280428903189 src/configuration.c:348 id:7430328224810359299 address:172.31.1.12:19001 role:1
LIBRAFT   1643698280428904426 src/configuration.c:348 id:13363669305107245142 address:172.31.1.13:19001 role:1
LIBRAFT   1643698280428905264 src/configuration.c:350 === CONFIG END ===
LIBRAFT   1643698280428906668 src/start.c:181 restore 8616 entries starting at 41490458

And finally as visible it fails to reach the .12 and .13 (as they no longer exist)

LIBRAFT   1643698312458756223 src/uv_send.c:279 connect attempt completed -> status no connection to remote server available
LIBRAFT   1643698312458758318 src/uv_send.c:317 queue full -> evict oldest message
LIBDQLITE 1643698312458848979 connect_work_cb:79 connect failed to 13363669305107245142@172.31.1.13:19001
LIBDQLITE 1643698312458857306 connect_after_work_cb:138 connect after work cb status 0
LIBRAFT   1643698312458859612 src/uv_send.c:279 connect attempt completed -> status no connection to remote server available
LIBRAFT   1643698312458861483 src/uv_send.c:317 queue full -> evict oldest message

Attempts to connect dqlite tried (pki paths with full path/partial, -s tcp:// prefixed etc., none of these happens as never connected. Documentation for microk8s revovery shall document these commands.

root@cmp1-nuc11:/var/snap/microk8s/current/var/kubernetes/backend# /snap/microk8s/current/bin/dqlite -c cluster.crt -k cluster.key -s 172.31.1.11:19001  k8s -f json ".tables" 

root@cmp1-nuc11:/var/snap/microk8s/current/var/kubernetes/backend# /snap/microk8s/current/bin/dqlite -c cluster.crt -k cluster.key -s 172.31.1.11:19001  k8s -f json ".leader"

/snap/microk8s/current/bin/dqlite -s 127.0.0.1:19001 -c /var/snap/microk8s/current/var/kubernetes/backend/cluster.crt -k /var/snap/microk8s/current/var/kubernetes/backend/cluster.key k8s -f json ".remove 172.31.1.12:19001"

Have not tested yet:

  • access to db over kine/etcd api

Issue Analytics

  • State:open
  • Created 2 years ago
  • Comments:5 (3 by maintainers)

github_iconTop GitHub Comments

2reactions
MathieuBorderecommented, Feb 1, 2022

warning: What I describe here is a reconfigure procedure for a dqlite cluster, AFAIK this procedure has not been tested in the context of an existing microk8s installation, I cannot guarantee if it will play well in this case as other variables are involved. Maybe @ktsakalozos should first validate if this procedure works in the case of microk8s.

dqlite has a way to load a new config into the raft cluster with the .reconfigure command, example usage here.

mathieu@linda:~ $ dqlite -s 127.0.0.1:9001 k8s .reconfigure
Error: bad command format, should be: .reconfigure <dir> <clusteryaml>
Args:
        dir - Directory of node with up to date data
        clusteryaml - Path to a .yaml file containing the desired cluster configuration

Help:
        Use this command when trying to preserve the data from your cluster while changing the
        configuration of the cluster because e.g. your cluster is broken due to unreachablee nodes.
        0. BACKUP ALL YOUR NODE DATA DIRECTORIES BEFORE PROCEEDING!
        1. Stop all dqlite nodes.
        2. Identify the dir of the node with the most up to date raft term and log, this will be the <dir> argument.
        3. Create a .yaml file with the same format as cluster.yaml (or use/adapt an existing cluster.yaml) with the
           desired cluster configuration. This will be the <clusteryaml> argument.
           Don't forget to make sure the ID's in the file line up with the ID's in the info.yaml files.
        4. Run the .reconfigure <dir> <clusteryaml> command, it should return "OK".
        5. Copy the snapshot-xxx-xxx-xxx, snapshot-xxx-xxx-xxx.meta, segment files (00000xxxxx-000000xxxxx), desired cluster.yaml
           from <dir> over to the directories of the other nodes identified in <clusteryaml>, deleting any leftover snapshot-xxx-xxx-xxx, snapshot-xxx-xxx-xxx.meta,
           segment (00000xxxxx-000000xxxxx, open-xxx) and metadata{1,2} files that it contains.
           Make sure an info.yaml is also present that is in line with cluster.yaml.
        6. Start all the dqlite nodes.
        7. If, for some reason, this fails or gives undesired results, try again with data from another node (you should still have this from step 0).

Skip step 5 if you only have a single node in your desired cluster configuration.

In your case you edit the cluster.yaml and remove the dead nodes from it. I expect your original cluster.yaml to look like this:

- Address: 172.31.1.11:19001
  ID: 15326470583229310013
  Role: 0
- Address: 172.31.1.11:19002
  ID: 7430328224810359299
  Role: 0
- Address: 172.31.1.11:19003
  ID: 13363669305107245142
  Role: 0

and then your new cluster.yaml would look like this:

- Address: 172.31.1.11:19001
  ID: 15326470583229310013
  Role: 0

After backing up all your data on your nodes and stopping the nodes, you can run the command on the alive node.

/snap/microk8s/current/bin/dqlite -s 127.0.0.1:19001 -c /var/snap/microk8s/current/var/kubernetes/backend/cluster.crt -k /var/snap/microk8s/current/var/kubernetes/backend/cluster.key k8s ".reconfigure /var/snap/microk8s/current/var/kubernetes/backend/ /var/snap/microk8s/current/var/kubernetes/backend/cluster.yaml"

the address after -s doesn’t really matter for this command, it’s the data directory and cluster.yaml that are important.

1reaction
neoaggeloscommented, Jun 28, 2022

Hi @tinklern. Thank you for bringing this to our attention, the documentation page has been updated and the change will be reflected shortly.

Read more comments on GitHub >

github_iconTop Results From Across the Web

How To Corrupt An SQLite Database File
Filesystems with broken or missing lock implementations. SQLite depends on the underlying filesystem to do locking as the documentation says it ...
Read more >
iOS forensics with SQLite data recovery: deep dive and ...
Today, we're going to take a dive into one of the approaches we use for recovering deleted messages: forensic recovery of SQLite data....
Read more >
Repairing Configuration Database SQLite3 Files | OpenVPN
It is extremely rare, but we do get the occasional question from a customer on how to recover or repair an SQLite3 configuration...
Read more >
Best 6 SQLite Database Recovery Tools
But SQLite database is also prone to corruption, just like any other data file. And recovering the corrupt SQLite database files with manual...
Read more >
bring2lite: A Structural Concept and Tool for Forensic Data ...
In the field of recovering deleted SQLite records, some commercial and ... in the scope of SQLite and tests the robustness of tools...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found