Etcd2 error check 2 failing after adding two worker nodes
See original GitHub issueI tried adding two worker nodes today using terraform to build them( no issues there ), then using the --limit command with ansible to limit the installer script to only those two new nodes.
ansible-playbook -e @security.yml sample.yml --limit "mantl-do-nyc2-worker-006,mantl-do-nyc2-worker-007"
The install went fine, no errors, everything came online but the Service ‘etcd’ check service:etcd:2, which is failing.
When i look at the worker node log, i see this:
Jun 17 20:15:10 mantl-do-nyc2-worker-006 etcd-service-start.sh: 2016/06/17 20:15:10 rafthttp: request sent was ignored (cluster ID mismatch: remote[a43f56d4501b2085]=590023a30e52fddb, local=1a8a3d8c3c391b4d)
When i look at a control node, i see this(scrolling):
Jun 17 20:18:27 mantl-do-nyc2-mcontrol-03 etcd-service-start.sh: 2016/06/17 20:18:27 rafthttp: streaming request ignored (cluster ID mismatch got 1a8a3d8c3c391b4d want 590023a30e52fddb)
I did notice something else, on the control node, the /etc/hosts file did not include the new nodes that were added. I imagine something needs to be done with that too. Wondering if that has to do with the limit command being used.
All other health checks are good.
- Ansible version (
ansible --version
): 1.9.6 - Python version (
python --version
): 2.7.5 - Git commit hash or branch: Master pulled on 5/31/16 with PR #1463 merged ( for ssl ).
- Cloud Environment: Digital Ocean
- Terraform version (
terraform version
): v0.6.14
Issue Analytics
- State:
- Created 7 years ago
- Comments:12 (6 by maintainers)
Top Results From Across the Web
Configure | Check if etcd cluster is healthy] is failing ... - GitHub
My cluster has one master and two worker nodes. The etcd check on the worker nodes is fine but not the master. ansible-playbook...
Read more >Testing of Etcd Failure - Zhimin Wen - Medium
Check the status, the leader is still there. Continue to stop the etcd service on node etcd3 and etcd2. When there are only...
Read more >1501752 – OCP cluster does not work after migrate from etcd2 ...
Cause: The etcd v3 data got migrated before the first snapshot of v2 data got written Consequence: Without a v2 snapshot the v3...
Read more >Troubleshooting etcd Nodes | Rancher Manager
This section contains commands and tips for troubleshooting nodes with the etcd role. Checking if the etcd Container is Running. The container ...
Read more >Troubleshooting kubeadm | Kubernetes
This page lists some common failure scenarios and have provided ... Go to github.com/kubernetes/kubeadm and search for existing issues.
Read more >
Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free
Top Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
@crumley has provided this playbook to reset the etcd cluster. It wipes it out, but it it does restore all nodes to working order.
I’m not convinced this is a duplicate of #1372. @viviandarkbloom and I added two new mantl worker nodes to our existing mantl installation in AWS. we ran into the same problems adding new etcd members to the existing cluster as @distributorofpain mentions above. the difference in solution though is we managed to fix the existing cluster rather than recreating it from scratch. I’m going to outline what we tried, what worked, and what didn’t.
worker_count
in terraform from 4 to 6.plan
ed,apply
ed. saw this issue with ebs volume attachments andcount
updates: https://github.com/hashicorp/terraform/issues/5240. we pushed ahead and ran into problems with the ebs volumes getting stuck in anattaching
state. we were able to finesse our way out if it with some forced detaches of volumes and instance reboots via the ec2 web console.ansible-playbook -e @security.yml mantl.yml
.vault
servers on a control node got sealed. i reran the unseal script on that one host and restarted theansible-playbook
which finished. this isn’t directly related to the etcd problem AFAIK.worker-001
andworker-002
. looking at etcd logs on any host we see lots of reps of lines like these:from here, we first elected the nuclear option re: etcd. we tried the ansible playbook above in this thread: https://github.com/CiscoCloud/mantl/issues/1566#issuecomment-228865979. tried that without success. then we saw a similar version of that playbook had arrived in mantl here: https://github.com/CiscoCloud/mantl/blob/master/playbooks/recreate-etcd-cluster.yml. we tried that also without success. the playbooks ran to completion but failed to get the entire etcd cluster back in a healthy state. cluster state was somehow surviving those playbook runs.
we noticed this thread a little late: https://github.com/CiscoCloud/mantl/issues/1216. i can’t comment on the effectiveness of this approach to adding a new worker.
more googling revealed this etcd thread, which mentions the same errors we were seeing: https://github.com/coreos/etcd/issues/3710. it recommends following the instructions in this doc for updating the cluster membership: https://coreos.com/etcd/docs/latest/runtime-configuration.html#cluster-reconfiguration-operations. this is good advice and ultimately what led us to our solution which follows:
systemctl stop etcd.service
failure to connect
messages in other cluster members’ etcd logs re: the hosts we stopped.etcdctl member remove <id>
rm -rf /var/lib/etcd/*
i did the following on one host at a time, though I’m guessing you might be able to do all hosts in parallel.
re-add the removed host to the cluster:
etcdctl member add infra3 http://10.0.1.13:2380
which prints out some helpful environment settings like:then on the the member we’re attempted to re-add to the cluster i started etcd in the foreground using a combination of its mantl-installed environment variables at the ones above:
set the initial cluster. it turned out to be very important that this host list match exactly the list of active etcd members plus the one we’re adding. i think if we had done
etcdctl member add
on all of the new hosts at once we could have used theETCD_INITIAL_CLUSTER
rendered by mantl in/etc/etcd/etcd.conf
. in our case we needed an intermediate host list omitting the hosts which hadn’t been added yet.then i lifted the etcd startup command out
/usr/lib/systemd/system/etcd.service
and started etcd in the foreground (as theroot
user):the logs should report a successful member join. then we stopped the process with
CTRL+c
. before we can get the service running as theetcd
user again (as systemd does), we had tochown -R etcd:etcd /var/lib/etcd/*
. thensystemctl start etcd.service
and the host should be running via systemd and a restored to healthy membership status.the reason we did this roundabout rejoin process is that the logic in the template rendered here:
/usr/local/bin/etcd-service-start.sh
didn’t appear to work correctly for new hosts attempting to join an existing cluster. because it is short-ish i’ll paste it here:in our example, we need
ETCD_INITIAL_CLUSTER_STATE=existing
even though the/var/lib/etcd/member
directory doesn’t exist. otherwise it will attempt to bootstrap a new etcd cluster rather than joining the existing one.in the end, both of the unhealthy members became healthy and all of the distributive and etcd service checks returned to a passing state. i’m hopeful this will be of use for improving the process for adding new worker hosts to an existing mantl installation.