dqlite recovery missing documentation and robustness
See original GitHub issueI have run cluster (1.23/stable) in HA of 3 nodes, 2 were lost. I tried a recovery from the remaining node finding out things are fishy.
Ultimate question is there any way to connect to such cluster db and recover, without editing a code and custom build?
Subject of the issue:
- no means to get real state, remove failed nodes from cluster and start it from single instance
- –debug is not helpful on k8s-dqlite, missing critical informations on server start
- k8s-dqlite always listen on 19001, no way to change it (by experience)
- snap shipped
dqlite
fails to connect to broken cluster, nothing logged - documentation is very poor on any recovery or debugging bad state of dqlite
- missing recovery commands (possibly
--single
flag to start localnode instance only would be ok at first) - “k8s” database name is sort of hardcoded? (how one shall know it? - not printed on server start but client requires it)
For further logs:
- 172.31.1.11 - physical node for recovery (no damage on this node)
- 172.31.1.12 - lost node
- 172.31.1.13 - lost node
First the single node cluster fails to start
root@cmp1-nuc11:~# /snap/microk8s/2948/bin/k8s-dqlite --debug=true --enable-tls --storage-dir=/var/snap/microk8s/2948/var/kubernetes/backend
INFO[0000] Starting dqlite
I0201 06:43:58.631095 427506 log.go:181] Failure domain set to 1
I0201 06:43:58.631109 427506 log.go:181] TLS enabled
FATA[0300] Failed to start server: context deadline exceeded
^^ fails some minutes later, no informative logs what port/clients/cluster state is till it fails 😕 ^^ see it open port, but never start listening and accepting clients
With debugging enabled
LIBRAFT 1643698280403872586 src/uv.c:470 start index 41490458, 8616 entries
LIBRAFT 1643698280403878194 src/start.c:152 current_term:29 voted_for:15326470583229310013 start_index:41490458 n_entries:8616
LIBRAFT 1643698280403882557 src/start.c:157 restore snapshot with last index 41499052 and last term 29
LIBDQLITE 1643698280403885569 fsm__restore:461 fsm restore
LIBDQLITE 1643698280403896910 db__init:18 db init k8s
LIBDQLITE 1643698280404038284 VfsRestore:2673 vfs restore filename k8s size 71391256
LIBRAFT 1643698280428892866 src/configuration.c:342 configuration restore from snapshot
LIBRAFT 1643698280428900325 src/configuration.c:343 === CONFIG START ===
LIBRAFT 1643698280428901857 src/configuration.c:348 id:15326470583229310013 address:172.31.1.11:19001 role:1
LIBRAFT 1643698280428903189 src/configuration.c:348 id:7430328224810359299 address:172.31.1.12:19001 role:1
LIBRAFT 1643698280428904426 src/configuration.c:348 id:13363669305107245142 address:172.31.1.13:19001 role:1
LIBRAFT 1643698280428905264 src/configuration.c:350 === CONFIG END ===
LIBRAFT 1643698280428906668 src/start.c:181 restore 8616 entries starting at 41490458
And finally as visible it fails to reach the .12 and .13 (as they no longer exist)
LIBRAFT 1643698312458756223 src/uv_send.c:279 connect attempt completed -> status no connection to remote server available
LIBRAFT 1643698312458758318 src/uv_send.c:317 queue full -> evict oldest message
LIBDQLITE 1643698312458848979 connect_work_cb:79 connect failed to 13363669305107245142@172.31.1.13:19001
LIBDQLITE 1643698312458857306 connect_after_work_cb:138 connect after work cb status 0
LIBRAFT 1643698312458859612 src/uv_send.c:279 connect attempt completed -> status no connection to remote server available
LIBRAFT 1643698312458861483 src/uv_send.c:317 queue full -> evict oldest message
Attempts to connect dqlite tried (pki paths with full path/partial, -s tcp://
prefixed etc., none of these happens as never connected. Documentation for microk8s revovery shall document these commands.
root@cmp1-nuc11:/var/snap/microk8s/current/var/kubernetes/backend# /snap/microk8s/current/bin/dqlite -c cluster.crt -k cluster.key -s 172.31.1.11:19001 k8s -f json ".tables"
root@cmp1-nuc11:/var/snap/microk8s/current/var/kubernetes/backend# /snap/microk8s/current/bin/dqlite -c cluster.crt -k cluster.key -s 172.31.1.11:19001 k8s -f json ".leader"
/snap/microk8s/current/bin/dqlite -s 127.0.0.1:19001 -c /var/snap/microk8s/current/var/kubernetes/backend/cluster.crt -k /var/snap/microk8s/current/var/kubernetes/backend/cluster.key k8s -f json ".remove 172.31.1.12:19001"
Have not tested yet:
- access to db over kine/etcd api
Issue Analytics
- State:
- Created 2 years ago
- Comments:5 (3 by maintainers)
Top GitHub Comments
warning: What I describe here is a reconfigure procedure for a dqlite cluster, AFAIK this procedure has not been tested in the context of an existing microk8s installation, I cannot guarantee if it will play well in this case as other variables are involved. Maybe @ktsakalozos should first validate if this procedure works in the case of microk8s.
dqlite has a way to load a new config into the raft cluster with the
.reconfigure
command, example usage here.Skip step 5 if you only have a single node in your desired cluster configuration.
In your case you edit the
cluster.yaml
and remove the dead nodes from it. I expect your originalcluster.yaml
to look like this:and then your new
cluster.yaml
would look like this:After backing up all your data on your nodes and stopping the nodes, you can run the command on the alive node.
/snap/microk8s/current/bin/dqlite -s 127.0.0.1:19001 -c /var/snap/microk8s/current/var/kubernetes/backend/cluster.crt -k /var/snap/microk8s/current/var/kubernetes/backend/cluster.key k8s ".reconfigure /var/snap/microk8s/current/var/kubernetes/backend/ /var/snap/microk8s/current/var/kubernetes/backend/cluster.yaml"
the address after
-s
doesn’t really matter for this command, it’s the data directory andcluster.yaml
that are important.Hi @tinklern. Thank you for bringing this to our attention, the documentation page has been updated and the change will be reflected shortly.