Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Fails to discover master and form a cluster

See original GitHub issue

Chart version: 7.3.0

Kubernetes version: 1.14.6

Kubernetes provider: E.g. GKE (Google Kubernetes Engine) AWS

Helm Version: v2.15.0

helm get release output e.g. helm get elasticsearch (replace elasticsearch with the name of your helm release)

REVISION: 1
RELEASED: Tue Oct 22 11:12:21 2019
CHART: elasticsearch-7.3.0
USER-SUPPLIED VALUES:
image: docker.elastic.co/elasticsearch/elasticsearch-oss
imageTag: 7.3.2
roles:
  ingest: false

COMPUTED VALUES:
antiAffinity: hard
antiAffinityTopologyKey: kubernetes.io/hostname
clusterHealthCheckParams: wait_for_status=green&timeout=1s
clusterName: elasticsearch
esConfig: {}
esJavaOpts: -Xmx1g -Xms1g
esMajorVersion: ""
extraEnvs: []
extraInitContainers: []
extraVolumeMounts: []
extraVolumes: []
fsGroup: ""
fullnameOverride: ""
httpPort: 9200
image: docker.elastic.co/elasticsearch/elasticsearch-oss
imagePullPolicy: IfNotPresent
imagePullSecrets: []
imageTag: 7.3.2
ingress:
  annotations: {}
  enabled: false
  hosts:
  - chart-example.local
  path: /
  tls: []
initResources: {}
labels: {}
lifecycle: {}
masterService: ""
masterTerminationFix: false
maxUnavailable: 1
minimumMasterNodes: 2
nameOverride: ""
networkHost: 0.0.0.0
nodeAffinity: {}
nodeGroup: master
nodeSelector: {}
persistence:
  annotations: {}
  enabled: true
podAnnotations: {}
podManagementPolicy: Parallel
podSecurityContext:
  fsGroup: 1000
priorityClassName: ""
protocol: http
readinessProbe:
  failureThreshold: 3
  initialDelaySeconds: 10
  periodSeconds: 10
  successThreshold: 3
  timeoutSeconds: 5
replicas: 3
resources:
  limits:
    cpu: 1000m
    memory: 2Gi
  requests:
    cpu: 100m
    memory: 2Gi
roles:
  data: "true"
  ingest: false
  master: "true"
schedulerName: ""
secretMounts: []
securityContext:
  capabilities:
    drop:
    - ALL
  runAsNonRoot: true
  runAsUser: 1000
service:
  annotations: {}
  nodePort: null
  type: ClusterIP
sidecarResources: {}
sysctlInitContainer:
  enabled: true
sysctlVmMaxMapCount: 262144
terminationGracePeriod: 120
tolerations: []
transportPort: 9300
updateStrategy: RollingUpdate
volumeClaimTemplate:
  accessModes:
  - ReadWriteOnce
  resources:
    requests:
      storage: 30Gi

HOOKS:
---
# elasticsearch-qxaid-test
apiVersion: v1
kind: Pod
metadata:
  name: "elasticsearch-qxaid-test"
  annotations:
    "helm.sh/hook": test-success
spec:
  containers:
  - name: "elasticsearch-dxggn-test"
    image: "docker.elastic.co/elasticsearch/elasticsearch-oss:7.3.2"
    command:
      - "sh"
      - "-c"
      - |
        #!/usr/bin/env bash -e
        curl -XGET --fail 'elasticsearch-master:9200/_cluster/health?wait_for_status=green&timeout=1s'
  restartPolicy: Never
MANIFEST:

---
# Source: elasticsearch/templates/poddisruptionbudget.yaml
apiVersion: policy/v1beta1
kind: PodDisruptionBudget
metadata:
  name: "elasticsearch-master-pdb"
spec:
  maxUnavailable: 1
  selector:
    matchLabels:
      app: "elasticsearch-master"
---
# Source: elasticsearch/templates/service.yaml
kind: Service
apiVersion: v1
metadata:
  name: elasticsearch-master
  labels:
    heritage: "Tiller"
    release: "elasticsearch"
    chart: "elasticsearch-7.3.0"
    app: "elasticsearch-master"
  annotations:
    {}
    
spec:
  type: ClusterIP
  selector:
    heritage: "Tiller"
    release: "elasticsearch"
    chart: "elasticsearch-7.3.0"
    app: "elasticsearch-master"
  ports:
  - name: http
    protocol: TCP
    port: 9200
  - name: transport
    protocol: TCP
    port: 9300
---
# Source: elasticsearch/templates/service.yaml
kind: Service
apiVersion: v1
metadata:
  name: elasticsearch-master-headless
  labels:
    heritage: "Tiller"
    release: "elasticsearch"
    chart: "elasticsearch-7.3.0"
    app: "elasticsearch-master"
  annotations:
    service.alpha.kubernetes.io/tolerate-unready-endpoints: "true"
spec:
  clusterIP: None # This is needed for statefulset hostnames like elasticsearch-0 to resolve
  # Create endpoints also if the related pod isn't ready
  publishNotReadyAddresses: true
  selector:
    app: "elasticsearch-master"
  ports:
  - name: http
    port: 9200
  - name: transport
    port: 9300
---
# Source: elasticsearch/templates/statefulset.yaml
apiVersion: apps/v1beta1
kind: StatefulSet
metadata:
  name: elasticsearch-master
  labels:
    heritage: "Tiller"
    release: "elasticsearch"
    chart: "elasticsearch-7.3.0"
    app: "elasticsearch-master"
  annotations:
    esMajorVersion: "7"
spec:
  serviceName: elasticsearch-master-headless
  selector:
    matchLabels:
      app: "elasticsearch-master"
  replicas: 3
  podManagementPolicy: Parallel
  updateStrategy:
    type: RollingUpdate
  volumeClaimTemplates:
  - metadata:
      name: elasticsearch-master
    spec:
      accessModes:
      - ReadWriteOnce
      resources:
        requests:
          storage: 30Gi
      
  template:
    metadata:
      name: "elasticsearch-master"
      labels:
        heritage: "Tiller"
        release: "elasticsearch"
        chart: "elasticsearch-7.3.0"
        app: "elasticsearch-master"
      annotations:
        
    spec:
      securityContext:
        fsGroup: 1000
        
      affinity:
        podAntiAffinity:
          requiredDuringSchedulingIgnoredDuringExecution:
          - labelSelector:
              matchExpressions:
              - key: app
                operator: In
                values:
                - "elasticsearch-master"
            topologyKey: kubernetes.io/hostname
      terminationGracePeriodSeconds: 120
      volumes:
      initContainers:
      - name: configure-sysctl
        securityContext:
          runAsUser: 0
          privileged: true
        image: "docker.elastic.co/elasticsearch/elasticsearch-oss:7.3.2"
        command: ["sysctl", "-w", "vm.max_map_count=262144"]
        resources:
          {}
          
      containers:
      - name: "elasticsearch"
        securityContext:
          capabilities:
            drop:
            - ALL
          runAsNonRoot: true
          runAsUser: 1000
          
        image: "docker.elastic.co/elasticsearch/elasticsearch-oss:7.3.2"
        imagePullPolicy: "IfNotPresent"
        readinessProbe:
          failureThreshold: 3
          initialDelaySeconds: 10
          periodSeconds: 10
          successThreshold: 3
          timeoutSeconds: 5
          
          exec:
            command:
              - sh
              - -c
              - |
                #!/usr/bin/env bash -e
                # If the node is starting up wait for the cluster to be ready (request params: 'wait_for_status=green&timeout=1s' )
                # Once it has started only check that the node itself is responding
                START_FILE=/tmp/.es_start_file

                http () {
                    local path="${1}"
                    if [ -n "${ELASTIC_USERNAME}" ] && [ -n "${ELASTIC_PASSWORD}" ]; then
                      BASIC_AUTH="-u ${ELASTIC_USERNAME}:${ELASTIC_PASSWORD}"
                    else
                      BASIC_AUTH=''
                    fi
                    curl -XGET -s -k --fail ${BASIC_AUTH} http://127.0.0.1:9200${path}
                }

                if [ -f "${START_FILE}" ]; then
                    echo 'Elasticsearch is already running, lets check the node is healthy'
                    http "/"
                else
                    echo 'Waiting for elasticsearch cluster to become cluster to be ready (request params: "wait_for_status=green&timeout=1s" )'
                    if http "/_cluster/health?wait_for_status=green&timeout=1s" ; then
                        touch ${START_FILE}
                        exit 0
                    else
                        echo 'Cluster is not yet ready (request params: "wait_for_status=green&timeout=1s" )'
                        exit 1
                    fi
                fi
        ports:
        - name: http
          containerPort: 9200
        - name: transport
          containerPort: 9300
        resources:
          limits:
            cpu: 1000m
            memory: 2Gi
          requests:
            cpu: 100m
            memory: 2Gi
          
        env:
          - name: node.name
            valueFrom:
              fieldRef:
                fieldPath: metadata.name
          - name: cluster.initial_master_nodes
            value: ""
          - name: discovery.seed_hosts
            value: "elasticsearch-master-headless"
          - name: cluster.name
            value: "elasticsearch"
          - name: network.host
            value: "0.0.0.0"
          - name: ES_JAVA_OPTS
            value: "-Xmx1g -Xms1g"
          - name: node.data
            value: "true"
          - name: node.ingest
            value: "false"
          - name: node.master
            value: "true"
        volumeMounts:
          - name: "elasticsearch-master"
            mountPath: /usr/share/elasticsearch/data

Describe the bug: We’ve been deploying the Elastic search 7.3.0 helm chart to a freshly built kops k8s cluster without issues for many weeks. However, I’ve tried to rebuild one environment yesterday and it always fails at the Elastic search deployment step now with the following error (see below) and for the sake of god I can’t figure out why it fails.

{"type": "server", "timestamp": "2019-10-22T18:10:22,660+0000", "level": "WARN", "component": "o.e.c.c.ClusterFormationFailureHelper", "cluster.name": "elasticsearch", "node.name": "elasticsearch-master-0",  "message": "master not discovered yet, this node has not previously joined a bootstrapped (v7+) cluster, and [cluster.initial_master_nodes] is empty on this node: have discovered [{elasticsearch-master-0}{jaY1E2suRxqrLK_HZxzI6w}{ZKO_C0-QSeSD2r5ZpY2wfA}{100.96.5.4}{100.96.5.4:9300}{dm}, {elasticsearch-master-2}{-xgmZe-2Q-GaZ_-x80LMhQ}{cd3IWwklR8GRAapYdoiISQ}{100.96.4.3}{100.96.4.3:9300}{dm}, {elasticsearch-master-1}{h2qFhHx8SHuwbzWyWM7Xvw}{awYThFu8RNGOilyZfqF9Xg}{100.96.3.6}{100.96.3.6:9300}{dm}]; discovery will continue using [100.96.4.3:9300, 100.96.3.6:9300] from hosts providers and [{elasticsearch-master-0}{jaY1E2suRxqrLK_HZxzI6w}{ZKO_C0-QSeSD2r5ZpY2wfA}{100.96.5.4}{100.96.5.4:9300}{dm}] from last-known cluster state; node term 0, last-accepted version 0 in term 0"  }
{"type": "server", "timestamp": "2019-10-22T18:10:27,653+0000", "level": "DEBUG", "component": "o.e.a.a.c.h.TransportClusterHealthAction", "cluster.name": "elasticsearch", "node.name": "elasticsearch-master-0",  "message": "no known master node, scheduling a retry"  }
{"type": "server", "timestamp": "2019-10-22T18:10:28,654+0000", "level": "DEBUG", "component": "o.e.a.a.c.h.TransportClusterHealthAction", "cluster.name": "elasticsearch", "node.name": "elasticsearch-master-0",  "message": "timed out while retrying [cluster:monitor/health] after failure (timeout [1s])"  }
{"type": "server", "timestamp": "2019-10-22T18:10:28,654+0000", "level": "WARN", "component": "r.suppressed", "cluster.name": "elasticsearch", "node.name": "elasticsearch-master-0",  "message": "path: /_cluster/health, params: {wait_for_status=green, timeout=1s}" , 
"stacktrace": ["org.elasticsearch.discovery.MasterNotDiscoveredException: null",
"at org.elasticsearch.action.support.master.TransportMasterNodeAction$AsyncSingleAction$3.onTimeout(TransportMasterNodeAction.java:251) [elasticsearch-7.3.2.jar:7.3.2]",
"at org.elasticsearch.cluster.ClusterStateObserver$ContextPreservingListener.onTimeout(ClusterStateObserver.java:325) [elasticsearch-7.3.2.jar:7.3.2]",
"at org.elasticsearch.cluster.ClusterStateObserver$ObserverClusterStateListener.onTimeout(ClusterStateObserver.java:252) [elasticsearch-7.3.2.jar:7.3.2]",
"at org.elasticsearch.cluster.service.ClusterApplierService$NotifyTimeout.run(ClusterApplierService.java:572) [elasticsearch-7.3.2.jar:7.3.2]",
"at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:688) [elasticsearch-7.3.2.jar:7.3.2]",
"at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) [?:?]",
"at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) [?:?]",
"at java.lang.Thread.run(Thread.java:835) [?:?]"] }

Steps to reproduce:

Create new k8s cluster using latest kops 1.14.0 release
Deploy elastic search helm chart with:

helm install elastic/elasticsearch --name elasticsearch --set image=docker.elastic.co/elasticsearch/elasticsearch-oss --set imageTag=7.3.2 --set roles.ingest=false --version 7.3.0 --atomic

Check each master pod log. It fails to locate master and form a cluster.

Expected behavior: Successful deployment of elastic search helm chart as before.

Provide logs and/or server output (if relevant):

$ kubectl describe service/elasticsearch-master 
---
Name:              elasticsearch-master
Namespace:         default
Labels:            app=elasticsearch-master
                   chart=elasticsearch-7.3.0
                   heritage=Tiller
                   release=elasticsearch
Annotations:       <none>
Selector:          app=elasticsearch-master,chart=elasticsearch-7.3.0,heritage=Tiller,release=elasticsearch
Type:              ClusterIP
IP:                100.67.88.213
Port:              http  9200/TCP
TargetPort:        9200/TCP
Endpoints:         
Port:              transport  9300/TCP
TargetPort:        9300/TCP
Endpoints:         
Session Affinity:  None
Events:            <none>

$ kubectl describe service/elasticsearch-master-headless
---
Name:              elasticsearch-master-headless
Namespace:         default
Labels:            app=elasticsearch-master
                   chart=elasticsearch-7.3.0
                   heritage=Tiller
                   release=elasticsearch
Annotations:       service.alpha.kubernetes.io/tolerate-unready-endpoints: true
Selector:          app=elasticsearch-master
Type:              ClusterIP
IP:                None
Port:              http  9200/TCP
TargetPort:        9200/TCP
Endpoints:         100.96.3.7:9200,100.96.4.5:9200,100.96.5.5:9200
Port:              transport  9300/TCP
TargetPort:        9300/TCP
Endpoints:         100.96.3.7:9300,100.96.4.5:9300,100.96.5.5:9300
Session Affinity:  None
Events:            <none>

$ kubectl get all
---
NAME                         READY   STATUS    RESTARTS   AGE
pod/busybox                  1/1     Running   0          14m
pod/elasticsearch-master-0   0/1     Running   0          9m49s
pod/elasticsearch-master-1   0/1     Running   0          9m49s
pod/elasticsearch-master-2   0/1     Running   0          9m49s

NAME                                    TYPE        CLUSTER-IP      EXTERNAL-IP   PORT(S)             AGE
service/elasticsearch-master            ClusterIP   100.67.88.213   <none>        9200/TCP,9300/TCP   9m49s
service/elasticsearch-master-headless   ClusterIP   None            <none>        9200/TCP,9300/TCP   9m49s
service/kubernetes                      ClusterIP   100.64.0.1      <none>        443/TCP             47m

NAME                                    READY   AGE
statefulset.apps/elasticsearch-master   0/3     9m49s

Issue Analytics

State:
Created 4 years ago
Comments:6 (2 by maintainers)

Top GitHub Comments

3reactions

demisxcommented, Oct 23, 2019

@jmlrt Understood. Thank you for sharing your thoughts. I think another good practice maybe just to stick with the helm version that is currently used in your tests/ folder. It just never occurred to me that going from one helm upgrade to another can result in a cluster formation error like this. I guess I’ve learned it the hard way! Properly upgrading apps is a small science already. 😄

1reaction

jmlrtcommented, Oct 23, 2019

Well Elastic Helm Charts should be compatible with every Helm v2 release (discloser we certainly won’t be compatible with Helm v3 with the current code) in theory as we don’t have code specific to some release.

However Helm 2.15.0 brought a lot of changes including one breaking change on Chart apiVersion and 2 regressions which forced them to release in emergency 2.15.1 a few hours ago (one change was fixed and the other was reverted).

If you take a look at these issues, you’ll see that it wasn’t impacted only Elastic charts but many other charts.

Overall I can advise to always test Helm version upgrades in a sandbox environment before (this is right for almost every softwares but specifically for Helm).

In addition I can add a mention of the Helm version that we currently test in the README.

Oh and by the way, meanwhile #338 has been merged and you should now be able to use Helm 2.15.1 with Elastic Charts if your other charts have no issues.

Top Results From Across the Web

Master node failed. restarting discovery - how to solve ... - Opster

The process known as discovery occurs when an Elasticsearch node starts, restarts or loses contact with the master node for any reason. In...

Troubleshooting discovery | Elasticsearch Guide [8.5] | Elastic

If your cluster doesn't have a stable master, many of its features won't work correctly and Elasticsearch will report errors to clients and...

Elasticsearch cluster 'master_not_discovered_exception'

The root cause of master not discovered exception is the nodes are not able to ping each other on port 9300. And this...

Bootstrapping a cluster

[master-a.example.com] master not discovered yet, this node has not previously joined a bootstrapped (v7+) cluster, and this node must discover master-eligible ...

Cluster formation - OpenSearch documentation

The nomenclature recently changed for the master node; it is now called the cluster manager node. multi-node cluster architecture diagram. Nodes. The following ......