K8s HA installation timed out on task "Join master to ControlPlane"
See original GitHub issueDescribe the bug
K8s HA installation fails randomly on task Join master to ControlPlane
on Azure environments.
To Reproduce Steps to reproduce the behavior:
- execute
epicli apply -f test.yml
Expected behavior The HA cluster was successfully deployed.
Config files Configuration that should be included in the yaml file:
specification:
components:
kubernetes_master:
count: 3
kubernetes_node:
count: 3
---
kind: configuration/shared-config
title: Shared configuration that will be visible to all roles
name: default
specification:
use_ha_control_plane: true
provider: azure
Task where the problem appears:
- when: not kubernetes_common.master_already_joined
block:
- include_role:
name: kubernetes_common
tasks_from: ensure-token
- block:
- name: Ensure /etc/kubeadm/ directory
file:
path: /etc/kubeadm/
state: directory
owner: root
group: root
mode: u=rw,go=r
- name: Render /etc/kubeadm/kubeadm-join-master.yml template
template:
src: kubeadm-join-master.yml.j2
dest: /etc/kubeadm/kubeadm-join-master.yml
owner: root
group: root
mode: u=rw,go=r
- name: Join master to ControlPlane
shell: |
kubeadm join \
--config /etc/kubeadm/kubeadm-join-master.yml
args:
executable: /bin/bash
- name: Mark master as joined
set_fact:
kubernetes_common: >-
{{ kubernetes_common | default({}) | combine(set_fact, recursive=true) }}
vars:
set_fact:
master_already_joined: true
- name: Include kubelet configuration tasks
include_role:
name: kubernetes_common
tasks_from: configure-kubelet
OS (please complete the following information):
- OS: [RHEL, Ubuntu]
Cloud Environment (please complete the following information):
- Cloud Provider [MS Azure]
Additional context Log:
2020-07-27T13:02:47.6918829Z 13:02:47 INFO cli.engine.ansible.AnsibleCommand - TASK [kubernetes_master : Join master to ControlPlane] *************************
2020-07-27T13:03:56.4662918Z 13:03:56 INFO cli.engine.ansible.AnsibleCommand - fatal: [ci-06hatodevazrhcanal-kubernetes-master-vm-2]: FAILED! => {"changed": true, "cmd": "kubeadm join --config /etc/kubeadm/kubeadm-join-master.yml\n", "delta": "0:01:07.997277", "end": "2020-07-27 13:03:56.339262", "msg": "non-zero return code", "rc": 1, "start": "2020-07-27 13:02:48.341985", "stderr": "W0727 13:02:48.379669 13265 join.go:346] [preflight] WARNING: JoinControlPane.controlPlane settings will be ignored when control-plane flag is not set.\n\t[WARNING IsDockerSystemdCheck]: detected \"cgroupfs\" as the Docker cgroup driver. The recommended driver is \"systemd\". Please follow the guide at https://kubernetes.io/docs/setup/cri/\n\t[WARNING Service-Kubelet]: kubelet service is not enabled, please run 'systemctl enable kubelet.service'\nW0727 13:03:20.332544 13265 manifests.go:214] the default kube-apiserver authorization-mode is \"Node,RBAC\"; using \"Node,RBAC\"\nW0727 13:03:20.339844 13265 manifests.go:214] the default kube-apiserver authorization-mode is \"Node,RBAC\"; using \"Node,RBAC\"\nW0727 13:03:20.340659 13265 manifests.go:214] the default kube-apiserver authorization-mode is \"Node,RBAC\"; using \"Node,RBAC\"\n{\"level\":\"warn\",\"ts\":\"2020-07-27T13:03:44.441Z\",\"caller\":\"clientv3/retry_interceptor.go:61\",\"msg\":\"retrying of unary invoker failed\",\"target\":\"passthrough:///https://10.1.1.9:2379\",\"attempt\":0,\"error\":\"rpc error: code = DeadlineExceeded desc = context deadline exceeded\"}\nerror execution phase control-plane-join/update-status: error uploading configuration: etcdserver: leader changed\nTo see the stack trace of this error execute with --v=5 or higher", "stderr_lines": ["W0727 13:02:48.379669 13265 join.go:346] [preflight] WARNING: JoinControlPane.controlPlane settings will be ignored when control-plane flag is not set.", "\t[WARNING IsDockerSystemdCheck]: detected \"cgroupfs\" as the Docker cgroup driver. The recommended driver is \"systemd\". Please follow the guide at https://kubernetes.io/docs/setup/cri/", "\t[WARNING Service-Kubelet]: kubelet service is not enabled, please run 'systemctl enable kubelet.service'", "W0727 13:03:20.332544 13265 manifests.go:214] the default kube-apiserver authorization-mode is \"Node,RBAC\"; using \"Node,RBAC\"", "W0727 13:03:20.339844 13265 manifests.go:214] the default kube-apiserver authorization-mode is \"Node,RBAC\"; using \"Node,RBAC\"", "W0727 13:03:20.340659 13265 manifests.go:214] the default kube-apiserver authorization-mode is \"Node,RBAC\"; using \"Node,RBAC\"", "{\"level\":\"warn\",\"ts\":\"2020-07-27T13:03:44.441Z\",\"caller\":\"clientv3/retry_interceptor.go:61\",\"msg\":\"retrying of unary invoker failed\",\"target\":\"passthrough:///https://10.1.1.9:2379\",\"attempt\":0,\"error\":\"rpc error: code = DeadlineExceeded desc = context deadline exceeded\"}", "error execution phase control-plane-join/update-status: error uploading configuration: etcdserver: leader changed", "To see the stack trace of this error execute with --v=5 or higher"], "stdout": "[preflight] Running pre-flight checks\n[preflight] Reading configuration from the cluster...\n[preflight] FYI: You can look at this config file with 'kubectl -n kube-system get cm kubeadm-config -oyaml'\n[preflight] Running pre-flight checks before initializing the new control plane instance\n[preflight] Pulling images required for setting up a Kubernetes cluster\n[preflight] This might take a minute or two, depending on the speed of your internet connection\n[preflight] You can also perform this action in beforehand using 'kubeadm config images pull'\n[certs] Using certificateDir folder \"/etc/kubernetes/pki\"\n[certs] Generating \"front-proxy-client\" certificate and key\n[certs] Generating \"etcd/server\" certificate and key\n[certs] etcd/server serving cert is signed for DNS names [ci-06hatodevazrhcanal-kubernetes-master-vm-2 localhost] and IPs [10.1.1.9 127.0.0.1 ::1]\n[certs] Generating \"etcd/peer\" certificate and key\n[certs] etcd/peer serving cert is signed for DNS names [ci-06hatodevazrhcanal-kubernetes-master-vm-2 localhost] and IPs [10.1.1.9 127.0.0.1 ::1]\n[certs] Generating \"etcd/healthcheck-client\" certificate and key\n[certs] Generating \"apiserver-etcd-client\" certificate and key\n[certs] Generating \"apiserver\" certificate and key\n[certs] apiserver serving cert is signed for DNS names [ci-06hatodevazrhcanal-kubernetes-master-vm-2 kubernetes kubernetes.default kubernetes.default.svc kubernetes.default.svc.cluster.local localhost] and IPs [10.96.0.1 10.1.1.9]\n[certs] Generating \"apiserver-kubelet-client\" certificate and key\n[certs] Valid certificates and keys now exist in \"/etc/kubernetes/pki\"\n[certs] Using the existing \"sa\" key\n[kubeconfig] Generating kubeconfig files\n[kubeconfig] Using kubeconfig folder \"/etc/kubernetes\"\n[endpoint] WARNING: port specified in controlPlaneEndpoint overrides bindPort in the controlplane address\n[kubeconfig] Writing \"admin.conf\" kubeconfig file\n[kubeconfig] Writing \"controller-manager.conf\" kubeconfig file\n[kubeconfig] Writing \"scheduler.conf\" kubeconfig file\n[control-plane] Using manifest folder \"/etc/kubernetes/manifests\"\n[control-plane] Creating static Pod manifest for \"kube-apiserver\"\n[control-plane] Creating static Pod manifest for \"kube-controller-manager\"\n[control-plane] Creating static Pod manifest for \"kube-scheduler\"\n[check-etcd] Checking that the etcd cluster is healthy\n[kubelet-start] Downloading configuration for the kubelet from the \"kubelet-config-1.17\" ConfigMap in the kube-system namespace\n[kubelet-start] Writing kubelet configuration to file \"/var/lib/kubelet/config.yaml\"\n[kubelet-start] Writing kubelet environment file with flags to file \"/var/lib/kubelet/kubeadm-flags.env\"\n[kubelet-start] Starting the kubelet\n[kubelet-start] Waiting for the kubelet to perform the TLS Bootstrap...\n[etcd] Announced new etcd member joining to the existing etcd cluster\n[etcd] Creating static Pod manifest for \"etcd\"\n[etcd] Waiting for the new etcd member to join the cluster. This can take up to 40s\n[upload-config] Storing the configuration used in ConfigMap \"kubeadm-config\" in the \"kube-system\" Namespace", "stdout_lines": ["[preflight] Running pre-flight checks", "[preflight] Reading configuration from the cluster...", "[preflight] FYI: You can look at this config file with 'kubectl -n kube-system get cm kubeadm-config -oyaml'", "[preflight] Running pre-flight checks before initializing the new control plane instance", "[preflight] Pulling images required for setting up a Kubernetes cluster", "[preflight] This might take a minute or two, depending on the speed of your internet connection", "[preflight] You can also perform this action in beforehand using 'kubeadm config images pull'", "[certs] Using certificateDir folder \"/etc/kubernetes/pki\"", "[certs] Generating \"front-proxy-client\" certificate and key", "[certs] Generating \"etcd/server\" certificate and key", "[certs] etcd/server serving cert is signed for DNS names [ci-06hatodevazrhcanal-kubernetes-master-vm-2 localhost] and IPs [10.1.1.9 127.0.0.1 ::1]", "[certs] Generating \"etcd/peer\" certificate and key", "[certs] etcd/peer serving cert is signed for DNS names [ci-06hatodevazrhcanal-kubernetes-master-vm-2 localhost] and IPs [10.1.1.9 127.0.0.1 ::1]", "[certs] Generating \"etcd/healthcheck-client\" certificate and key", "[certs] Generating \"apiserver-etcd-client\" certificate and key", "[certs] Generating \"apiserver\" certificate and key", "[certs] apiserver serving cert is signed for DNS names [ci-06hatodevazrhcanal-kubernetes-master-vm-2 kubernetes kubernetes.default kubernetes.default.svc kubernetes.default.svc.cluster.local localhost] and IPs [10.96.0.1 10.1.1.9]", "[certs] Generating \"apiserver-kubelet-client\" certificate and key", "[certs] Valid certificates and keys now exist in \"/etc/kubernetes/pki\"", "[certs] Using the existing \"sa\" key", "[kubeconfig] Generating kubeconfig files", "[kubeconfig] Using kubeconfig folder \"/etc/kubernetes\"", "[endpoint] WARNING: port specified in controlPlaneEndpoint overrides bindPort in the controlplane address", "[kubeconfig] Writing \"admin.conf\" kubeconfig file", "[kubeconfig] Writing \"controller-manager.conf\" kubeconfig file", "[kubeconfig] Writing \"scheduler.conf\" kubeconfig file", "[control-plane] Using manifest folder \"/etc/kubernetes/manifests\"", "[control-plane] Creating static Pod manifest for \"kube-apiserver\"", "[control-plane] Creating static Pod manifest for \"kube-controller-manager\"", "[control-plane] Creating static Pod manifest for \"kube-scheduler\"", "[check-etcd] Checking that the etcd cluster is healthy", "[kubelet-start] Downloading configuration for the kubelet from the \"kubelet-config-1.17\" ConfigMap in the kube-system namespace", "[kubelet-start] Writing kubelet configuration to file \"/var/lib/kubelet/config.yaml\"", "[kubelet-start] Writing kubelet environment file with flags to file \"/var/lib/kubelet/kubeadm-flags.env\"", "[kubelet-start] Starting the kubelet", "[kubelet-start] Waiting for the kubelet to perform the TLS Bootstrap...", "[etcd] Announced new etcd member joining to the existing etcd cluster", "[etcd] Creating static Pod manifest for \"etcd\"", "[etcd] Waiting for the new etcd member to join the cluster. This can take up to 40s", "[upload-config] Storing the configuration used in ConfigMap \"kubeadm-config\" in the \"kube-system\" Namespace"]}
It happens randomly. On average, one of the two HA deployments on Azure fails because of this issue.
Issue Analytics
- State:
- Created 3 years ago
- Comments:7 (6 by maintainers)
Top Results From Across the Web
Creating Highly Available Clusters with kubeadm | Kubernetes
If a timeout occurs, reconfigure the load balancer to communicate with the control plane node. Add the remaining control plane nodes to the...
Read more >Marking the master by adding the taints 'error ... - GitHub
i have fixed this problem by disable etcd tls. cat kubeadm-config.yaml. apiVersion: kubeadm.k8s.io/v1alpha3 kind: ClusterConfiguration ...
Read more >Building Kubernetes High Availability Clusters: Easy Guide 101
Method 1: Create Kubernetes High Availability Cluster With Stacked Control ... to interact with the control plane node if a timeout occurs.
Read more >Troubleshooting installations | OpenShift Container Platform 4.8
The bootstrap machine boots and starts hosting the remote resources required for the control plane machines (also known as the master machines) to...
Read more >KubeKey 离线环境部署 KubeSphere v3.0.0
TASK [common : Kubesphere | Deploy minio] ... fullnameOverride=minio –namespace kubesphere-system –wait –timeout 1800s\n”, ... k8s版本1.18.6.
Read more >Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start FreeTop Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Top GitHub Comments
Now this has been tested with the latest develop code with Kubernetes 1.18.6 and I haven’t been able to reproduce the problem yet. More testing still underway. The problem has been reproduced many times in version 0.6.
Handled in this PR