Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Etcd2 error check 2 failing after adding two worker nodes

See original GitHub issue

I tried adding two worker nodes today using terraform to build them( no issues there ), then using the --limit command with ansible to limit the installer script to only those two new nodes.

ansible-playbook -e @security.yml sample.yml --limit "mantl-do-nyc2-worker-006,mantl-do-nyc2-worker-007"

The install went fine, no errors, everything came online but the Service ‘etcd’ check service:etcd:2, which is failing.

When i look at the worker node log, i see this:

Jun 17 20:15:10 mantl-do-nyc2-worker-006 etcd-service-start.sh: 2016/06/17 20:15:10 rafthttp: request sent was ignored (cluster ID mismatch: remote[a43f56d4501b2085]=590023a30e52fddb, local=1a8a3d8c3c391b4d)

When i look at a control node, i see this(scrolling):

Jun 17 20:18:27 mantl-do-nyc2-mcontrol-03 etcd-service-start.sh: 2016/06/17 20:18:27 rafthttp: streaming request ignored (cluster ID mismatch got 1a8a3d8c3c391b4d want 590023a30e52fddb)

I did notice something else, on the control node, the /etc/hosts file did not include the new nodes that were added. I imagine something needs to be done with that too. Wondering if that has to do with the limit command being used.

All other health checks are good.

Ansible version (ansible --version): 1.9.6
Python version (python --version): 2.7.5
Git commit hash or branch: Master pulled on 5/31/16 with PR #1463 merged ( for ssl ).
Cloud Environment: Digital Ocean
Terraform version (terraform version): v0.6.14

Issue Analytics

State:
Created 7 years ago
Comments:12 (6 by maintainers)

Top GitHub Comments

1reaction

distributorofpaincommented, Jun 27, 2016

@crumley has provided this playbook to reset the etcd cluster. It wipes it out, but it it does restore all nodes to working order.

- hosts: role=worker:role=control:role=kubeworker:role=edge
  tasks:
    - name: Stop etcd
      sudo: yes
      service:
        name: etcd
        state: stopped
    - name: Delete etcd data directory
      sudo: yes
      shell: rm -rf /var/lib/etcd/*
    - name: Start etcd
      sudo: yes
      service:
        name: etcd
        state: started

0reactions

tphummelcommented, Aug 26, 2016

osx: 10.11.6 ansible: 2.1.1.0 python: 2.7.10 terraform: 0.7.0 mantl: master 3921f15e682e04f45a53f3d51f864109201c7ef6 (with a few small tweaks) provider: AWS

I’m not convinced this is a duplicate of #1372. @viviandarkbloom and I added two new mantl worker nodes to our existing mantl installation in AWS. we ran into the same problems adding new etcd members to the existing cluster as @distributorofpain mentions above. the difference in solution though is we managed to fix the existing cluster rather than recreating it from scratch. I’m going to outline what we tried, what worked, and what didn’t.

we increased the worker_count in terraform from 4 to 6. planed, applyed. saw this issue with ebs volume attachments and count updates: https://github.com/hashicorp/terraform/issues/5240. we pushed ahead and ran into problems with the ebs volumes getting stuck in an attaching state. we were able to finesse our way out if it with some forced detaches of volumes and instance reboots via the ec2 web console.
we did the main ansible run without targeting any subset of the mantl hosts. roughly: ansible-playbook -e @security.yml mantl.yml.
the ansible run ran into a strange situation where one of the vault servers on a control node got sealed. i reran the unseal script on that one host and restarted the ansible-playbook which finished. this isn’t directly related to the etcd problem AFAIK.
following ansible-playbook completion, we see several of the distributive checks in consul failing. the etcd service is failing it’s second health check on worker-001 and worker-002. looking at etcd logs on any host we see lots of reps of lines like these:

24 21:02:01 demo-edge-01 etcd: streaming request ignored (cluster ID mismatch got 49e9380083da3b4b want 93724bfe0b791e78)
Aug 24 21:02:01 demo-edge-01 etcd: streaming request ignored (cluster ID mismatch got 49e9380083da3b4b want 93724bfe0b791e78)
Aug 24 21:02:01 demo-edge-01 etcd: streaming request ignored (cluster ID mismatch got 49e9380083da3b4b want 93724bfe0b791e78)
Aug 24 21:02:02 demo-edge-01 etcd: streaming request ignored (cluster ID mismatch got 49e9380083da3b4b want 93724bfe0b791e78)

from here, we first elected the nuclear option re: etcd. we tried the ansible playbook above in this thread: https://github.com/CiscoCloud/mantl/issues/1566#issuecomment-228865979. tried that without success. then we saw a similar version of that playbook had arrived in mantl here: https://github.com/CiscoCloud/mantl/blob/master/playbooks/recreate-etcd-cluster.yml. we tried that also without success. the playbooks ran to completion but failed to get the entire etcd cluster back in a healthy state. cluster state was somehow surviving those playbook runs.

we noticed this thread a little late: https://github.com/CiscoCloud/mantl/issues/1216. i can’t comment on the effectiveness of this approach to adding a new worker.

more googling revealed this etcd thread, which mentions the same errors we were seeing: https://github.com/coreos/etcd/issues/3710. it recommends following the instructions in this doc for updating the cluster membership: https://coreos.com/etcd/docs/latest/runtime-configuration.html#cluster-reconfiguration-operations. this is good advice and ultimately what led us to our solution which follows:

stop the etcd process on each failing host: systemctl stop etcd.service
you should begin seeing failure to connect messages in other cluster members’ etcd logs re: the hosts we stopped.
those failure messages will mention the member id of the services we stopped
remove each stopped member from the cluster on a separate member host which is still connected: etcdctl member remove <id>
delete the etcd data directory from each stopped member: rm -rf /var/lib/etcd/*

i did the following on one host at a time, though I’m guessing you might be able to do all hosts in parallel.

re-add the removed host to the cluster: etcdctl member add infra3 http://10.0.1.13:2380 which prints out some helpful environment settings like:

ETCD_NAME="infra3"
ETCD_INITIAL_CLUSTER="infra0=http://10.0.1.10:2380,infra1=http://10.0.1.11:2380,infra2=http://10.0.1.12:2380,infra3=http://10.0.1.13:2380"
ETCD_INITIAL_CLUSTER_STATE=existing

then on the the member we’re attempted to re-add to the cluster i started etcd in the foreground using a combination of its mantl-installed environment variables at the ones above:

set -o allexport
source /etc/etcd/etcd.conf
set +o allexport

set the initial cluster. it turned out to be very important that this host list match exactly the list of active etcd members plus the one we’re adding. i think if we had done etcdctl member add on all of the new hosts at once we could have used the ETCD_INITIAL_CLUSTER rendered by mantl in /etc/etcd/etcd.conf. in our case we needed an intermediate host list omitting the hosts which hadn’t been added yet.

export ETCD_INITIAL_CLUSTER="infra0=http://10.0.1.10:2380,infra1=http://10.0.1.11:2380,infra2=http://10.0.1.12:2380,infra3=http://10.0.1.13:2380"

then i lifted the etcd startup command out /usr/lib/systemd/system/etcd.service and started etcd in the foreground (as the root user):

ETCD_INITIAL_CLUSTER_STATE="existing" GOMAXPROCS=$(nproc) /usr/bin/etcd --name="${ETCD_NAME}" --data-dir="${ETCD_DATA_DIR}" --listen-client-urls="${ETCD_LISTEN_CLIENT_URLS}"

the logs should report a successful member join. then we stopped the process with CTRL+c. before we can get the service running as the etcd user again (as systemd does), we had to chown -R etcd:etcd /var/lib/etcd/*. then systemctl start etcd.service and the host should be running via systemd and a restored to healthy membership status.

the reason we did this roundabout rejoin process is that the logic in the template rendered here: /usr/local/bin/etcd-service-start.sh didn’t appear to work correctly for new hosts attempting to join an existing cluster. because it is short-ish i’ll paste it here:

#! /bin/sh
export GOMAXPROC=$(nproc)
if test -d /var/lib/etcd/member; then
  ETCD_INITIAL_CLUSTER_STATE=existing
  unset ETCD_INITIAL_ADVERTISE_PEER_URLS
  unset ETCD_INITIAL_CLUSTER
else
  ETCD_INITIAL_CLUSTER_STATE=new
fi
export ETCD_INITIAL_CLUSTER_STATE
exec /usr/bin/etcd "$@"

in our example, we need ETCD_INITIAL_CLUSTER_STATE=existing even though the /var/lib/etcd/member directory doesn’t exist. otherwise it will attempt to bootstrap a new etcd cluster rather than joining the existing one.

in the end, both of the unhealthy members became healthy and all of the distributive and etcd service checks returned to a passing state. i’m hopeful this will be of use for improving the process for adding new worker hosts to an existing mantl installation.