Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

K8s HA installation timed out on task "Join master to ControlPlane"

See original GitHub issue

Describe the bug K8s HA installation fails randomly on task Join master to ControlPlane on Azure environments.

To Reproduce Steps to reproduce the behavior:

execute epicli apply -f test.yml

Expected behavior The HA cluster was successfully deployed.

Config files Configuration that should be included in the yaml file:

specification:
  components:
    kubernetes_master:
      count: 3
    kubernetes_node:
      count: 3

---
kind: configuration/shared-config
title: Shared configuration that will be visible to all roles
name: default
specification:
  use_ha_control_plane: true
provider: azure

Task where the problem appears:

- when: not kubernetes_common.master_already_joined
  block:
    - include_role:
        name: kubernetes_common
        tasks_from: ensure-token

    - block:
        - name: Ensure /etc/kubeadm/ directory
          file:
            path: /etc/kubeadm/
            state: directory
            owner: root
            group: root
            mode: u=rw,go=r

        - name: Render /etc/kubeadm/kubeadm-join-master.yml template
          template:
            src: kubeadm-join-master.yml.j2
            dest: /etc/kubeadm/kubeadm-join-master.yml
            owner: root
            group: root
            mode: u=rw,go=r

        - name: Join master to ControlPlane
          shell: |
            kubeadm join \
              --config /etc/kubeadm/kubeadm-join-master.yml
          args:
            executable: /bin/bash

        - name: Mark master as joined
          set_fact:
            kubernetes_common: >-
              {{ kubernetes_common | default({}) | combine(set_fact, recursive=true) }}
          vars:
            set_fact:
              master_already_joined: true

- name: Include kubelet configuration tasks
  include_role:
    name: kubernetes_common
    tasks_from: configure-kubelet

OS (please complete the following information):

OS: [RHEL, Ubuntu]

Cloud Environment (please complete the following information):

Cloud Provider [MS Azure]

Additional context Log:

2020-07-27T13:02:47.6918829Z 13:02:47 INFO cli.engine.ansible.AnsibleCommand - TASK [kubernetes_master : Join master to ControlPlane] *************************
2020-07-27T13:03:56.4662918Z 13:03:56 INFO cli.engine.ansible.AnsibleCommand - fatal: [ci-06hatodevazrhcanal-kubernetes-master-vm-2]: FAILED! => {"changed": true, "cmd": "kubeadm join  --config /etc/kubeadm/kubeadm-join-master.yml\n", "delta": "0:01:07.997277", "end": "2020-07-27 13:03:56.339262", "msg": "non-zero return code", "rc": 1, "start": "2020-07-27 13:02:48.341985", "stderr": "W0727 13:02:48.379669   13265 join.go:346] [preflight] WARNING: JoinControlPane.controlPlane settings will be ignored when control-plane flag is not set.\n\t[WARNING IsDockerSystemdCheck]: detected \"cgroupfs\" as the Docker cgroup driver. The recommended driver is \"systemd\". Please follow the guide at https://kubernetes.io/docs/setup/cri/\n\t[WARNING Service-Kubelet]: kubelet service is not enabled, please run 'systemctl enable kubelet.service'\nW0727 13:03:20.332544   13265 manifests.go:214] the default kube-apiserver authorization-mode is \"Node,RBAC\"; using \"Node,RBAC\"\nW0727 13:03:20.339844   13265 manifests.go:214] the default kube-apiserver authorization-mode is \"Node,RBAC\"; using \"Node,RBAC\"\nW0727 13:03:20.340659   13265 manifests.go:214] the default kube-apiserver authorization-mode is \"Node,RBAC\"; using \"Node,RBAC\"\n{\"level\":\"warn\",\"ts\":\"2020-07-27T13:03:44.441Z\",\"caller\":\"clientv3/retry_interceptor.go:61\",\"msg\":\"retrying of unary invoker failed\",\"target\":\"passthrough:///https://10.1.1.9:2379\",\"attempt\":0,\"error\":\"rpc error: code = DeadlineExceeded desc = context deadline exceeded\"}\nerror execution phase control-plane-join/update-status: error uploading configuration: etcdserver: leader changed\nTo see the stack trace of this error execute with --v=5 or higher", "stderr_lines": ["W0727 13:02:48.379669   13265 join.go:346] [preflight] WARNING: JoinControlPane.controlPlane settings will be ignored when control-plane flag is not set.", "\t[WARNING IsDockerSystemdCheck]: detected \"cgroupfs\" as the Docker cgroup driver. The recommended driver is \"systemd\". Please follow the guide at https://kubernetes.io/docs/setup/cri/", "\t[WARNING Service-Kubelet]: kubelet service is not enabled, please run 'systemctl enable kubelet.service'", "W0727 13:03:20.332544   13265 manifests.go:214] the default kube-apiserver authorization-mode is \"Node,RBAC\"; using \"Node,RBAC\"", "W0727 13:03:20.339844   13265 manifests.go:214] the default kube-apiserver authorization-mode is \"Node,RBAC\"; using \"Node,RBAC\"", "W0727 13:03:20.340659   13265 manifests.go:214] the default kube-apiserver authorization-mode is \"Node,RBAC\"; using \"Node,RBAC\"", "{\"level\":\"warn\",\"ts\":\"2020-07-27T13:03:44.441Z\",\"caller\":\"clientv3/retry_interceptor.go:61\",\"msg\":\"retrying of unary invoker failed\",\"target\":\"passthrough:///https://10.1.1.9:2379\",\"attempt\":0,\"error\":\"rpc error: code = DeadlineExceeded desc = context deadline exceeded\"}", "error execution phase control-plane-join/update-status: error uploading configuration: etcdserver: leader changed", "To see the stack trace of this error execute with --v=5 or higher"], "stdout": "[preflight] Running pre-flight checks\n[preflight] Reading configuration from the cluster...\n[preflight] FYI: You can look at this config file with 'kubectl -n kube-system get cm kubeadm-config -oyaml'\n[preflight] Running pre-flight checks before initializing the new control plane instance\n[preflight] Pulling images required for setting up a Kubernetes cluster\n[preflight] This might take a minute or two, depending on the speed of your internet connection\n[preflight] You can also perform this action in beforehand using 'kubeadm config images pull'\n[certs] Using certificateDir folder \"/etc/kubernetes/pki\"\n[certs] Generating \"front-proxy-client\" certificate and key\n[certs] Generating \"etcd/server\" certificate and key\n[certs] etcd/server serving cert is signed for DNS names [ci-06hatodevazrhcanal-kubernetes-master-vm-2 localhost] and IPs [10.1.1.9 127.0.0.1 ::1]\n[certs] Generating \"etcd/peer\" certificate and key\n[certs] etcd/peer serving cert is signed for DNS names [ci-06hatodevazrhcanal-kubernetes-master-vm-2 localhost] and IPs [10.1.1.9 127.0.0.1 ::1]\n[certs] Generating \"etcd/healthcheck-client\" certificate and key\n[certs] Generating \"apiserver-etcd-client\" certificate and key\n[certs] Generating \"apiserver\" certificate and key\n[certs] apiserver serving cert is signed for DNS names [ci-06hatodevazrhcanal-kubernetes-master-vm-2 kubernetes kubernetes.default kubernetes.default.svc kubernetes.default.svc.cluster.local localhost] and IPs [10.96.0.1 10.1.1.9]\n[certs] Generating \"apiserver-kubelet-client\" certificate and key\n[certs] Valid certificates and keys now exist in \"/etc/kubernetes/pki\"\n[certs] Using the existing \"sa\" key\n[kubeconfig] Generating kubeconfig files\n[kubeconfig] Using kubeconfig folder \"/etc/kubernetes\"\n[endpoint] WARNING: port specified in controlPlaneEndpoint overrides bindPort in the controlplane address\n[kubeconfig] Writing \"admin.conf\" kubeconfig file\n[kubeconfig] Writing \"controller-manager.conf\" kubeconfig file\n[kubeconfig] Writing \"scheduler.conf\" kubeconfig file\n[control-plane] Using manifest folder \"/etc/kubernetes/manifests\"\n[control-plane] Creating static Pod manifest for \"kube-apiserver\"\n[control-plane] Creating static Pod manifest for \"kube-controller-manager\"\n[control-plane] Creating static Pod manifest for \"kube-scheduler\"\n[check-etcd] Checking that the etcd cluster is healthy\n[kubelet-start] Downloading configuration for the kubelet from the \"kubelet-config-1.17\" ConfigMap in the kube-system namespace\n[kubelet-start] Writing kubelet configuration to file \"/var/lib/kubelet/config.yaml\"\n[kubelet-start] Writing kubelet environment file with flags to file \"/var/lib/kubelet/kubeadm-flags.env\"\n[kubelet-start] Starting the kubelet\n[kubelet-start] Waiting for the kubelet to perform the TLS Bootstrap...\n[etcd] Announced new etcd member joining to the existing etcd cluster\n[etcd] Creating static Pod manifest for \"etcd\"\n[etcd] Waiting for the new etcd member to join the cluster. This can take up to 40s\n[upload-config] Storing the configuration used in ConfigMap \"kubeadm-config\" in the \"kube-system\" Namespace", "stdout_lines": ["[preflight] Running pre-flight checks", "[preflight] Reading configuration from the cluster...", "[preflight] FYI: You can look at this config file with 'kubectl -n kube-system get cm kubeadm-config -oyaml'", "[preflight] Running pre-flight checks before initializing the new control plane instance", "[preflight] Pulling images required for setting up a Kubernetes cluster", "[preflight] This might take a minute or two, depending on the speed of your internet connection", "[preflight] You can also perform this action in beforehand using 'kubeadm config images pull'", "[certs] Using certificateDir folder \"/etc/kubernetes/pki\"", "[certs] Generating \"front-proxy-client\" certificate and key", "[certs] Generating \"etcd/server\" certificate and key", "[certs] etcd/server serving cert is signed for DNS names [ci-06hatodevazrhcanal-kubernetes-master-vm-2 localhost] and IPs [10.1.1.9 127.0.0.1 ::1]", "[certs] Generating \"etcd/peer\" certificate and key", "[certs] etcd/peer serving cert is signed for DNS names [ci-06hatodevazrhcanal-kubernetes-master-vm-2 localhost] and IPs [10.1.1.9 127.0.0.1 ::1]", "[certs] Generating \"etcd/healthcheck-client\" certificate and key", "[certs] Generating \"apiserver-etcd-client\" certificate and key", "[certs] Generating \"apiserver\" certificate and key", "[certs] apiserver serving cert is signed for DNS names [ci-06hatodevazrhcanal-kubernetes-master-vm-2 kubernetes kubernetes.default kubernetes.default.svc kubernetes.default.svc.cluster.local localhost] and IPs [10.96.0.1 10.1.1.9]", "[certs] Generating \"apiserver-kubelet-client\" certificate and key", "[certs] Valid certificates and keys now exist in \"/etc/kubernetes/pki\"", "[certs] Using the existing \"sa\" key", "[kubeconfig] Generating kubeconfig files", "[kubeconfig] Using kubeconfig folder \"/etc/kubernetes\"", "[endpoint] WARNING: port specified in controlPlaneEndpoint overrides bindPort in the controlplane address", "[kubeconfig] Writing \"admin.conf\" kubeconfig file", "[kubeconfig] Writing \"controller-manager.conf\" kubeconfig file", "[kubeconfig] Writing \"scheduler.conf\" kubeconfig file", "[control-plane] Using manifest folder \"/etc/kubernetes/manifests\"", "[control-plane] Creating static Pod manifest for \"kube-apiserver\"", "[control-plane] Creating static Pod manifest for \"kube-controller-manager\"", "[control-plane] Creating static Pod manifest for \"kube-scheduler\"", "[check-etcd] Checking that the etcd cluster is healthy", "[kubelet-start] Downloading configuration for the kubelet from the \"kubelet-config-1.17\" ConfigMap in the kube-system namespace", "[kubelet-start] Writing kubelet configuration to file \"/var/lib/kubelet/config.yaml\"", "[kubelet-start] Writing kubelet environment file with flags to file \"/var/lib/kubelet/kubeadm-flags.env\"", "[kubelet-start] Starting the kubelet", "[kubelet-start] Waiting for the kubelet to perform the TLS Bootstrap...", "[etcd] Announced new etcd member joining to the existing etcd cluster", "[etcd] Creating static Pod manifest for \"etcd\"", "[etcd] Waiting for the new etcd member to join the cluster. This can take up to 40s", "[upload-config] Storing the configuration used in ConfigMap \"kubeadm-config\" in the \"kube-system\" Namespace"]}

It happens randomly. On average, one of the two HA deployments on Azure fails because of this issue.

Issue Analytics

State:
Created 3 years ago
Comments:7 (6 by maintainers)

Top GitHub Comments

2reactions

przemyslaviccommented, Jul 31, 2020

Now this has been tested with the latest develop code with Kubernetes 1.18.6 and I haven’t been able to reproduce the problem yet. More testing still underway. The problem has been reproduced many times in version 0.6.

0reactions

mkyccommented, Apr 8, 2021

Handled in this PR