Fails to discover master and form a cluster
See original GitHub issueChart version: 7.3.0
Kubernetes version: 1.14.6
Kubernetes provider: E.g. GKE (Google Kubernetes Engine) AWS
Helm Version: v2.15.0
helm get release
output
e.g. helm get elasticsearch
(replace elasticsearch
with the name of your helm release)
REVISION: 1
RELEASED: Tue Oct 22 11:12:21 2019
CHART: elasticsearch-7.3.0
USER-SUPPLIED VALUES:
image: docker.elastic.co/elasticsearch/elasticsearch-oss
imageTag: 7.3.2
roles:
ingest: false
COMPUTED VALUES:
antiAffinity: hard
antiAffinityTopologyKey: kubernetes.io/hostname
clusterHealthCheckParams: wait_for_status=green&timeout=1s
clusterName: elasticsearch
esConfig: {}
esJavaOpts: -Xmx1g -Xms1g
esMajorVersion: ""
extraEnvs: []
extraInitContainers: []
extraVolumeMounts: []
extraVolumes: []
fsGroup: ""
fullnameOverride: ""
httpPort: 9200
image: docker.elastic.co/elasticsearch/elasticsearch-oss
imagePullPolicy: IfNotPresent
imagePullSecrets: []
imageTag: 7.3.2
ingress:
annotations: {}
enabled: false
hosts:
- chart-example.local
path: /
tls: []
initResources: {}
labels: {}
lifecycle: {}
masterService: ""
masterTerminationFix: false
maxUnavailable: 1
minimumMasterNodes: 2
nameOverride: ""
networkHost: 0.0.0.0
nodeAffinity: {}
nodeGroup: master
nodeSelector: {}
persistence:
annotations: {}
enabled: true
podAnnotations: {}
podManagementPolicy: Parallel
podSecurityContext:
fsGroup: 1000
priorityClassName: ""
protocol: http
readinessProbe:
failureThreshold: 3
initialDelaySeconds: 10
periodSeconds: 10
successThreshold: 3
timeoutSeconds: 5
replicas: 3
resources:
limits:
cpu: 1000m
memory: 2Gi
requests:
cpu: 100m
memory: 2Gi
roles:
data: "true"
ingest: false
master: "true"
schedulerName: ""
secretMounts: []
securityContext:
capabilities:
drop:
- ALL
runAsNonRoot: true
runAsUser: 1000
service:
annotations: {}
nodePort: null
type: ClusterIP
sidecarResources: {}
sysctlInitContainer:
enabled: true
sysctlVmMaxMapCount: 262144
terminationGracePeriod: 120
tolerations: []
transportPort: 9300
updateStrategy: RollingUpdate
volumeClaimTemplate:
accessModes:
- ReadWriteOnce
resources:
requests:
storage: 30Gi
HOOKS:
---
# elasticsearch-qxaid-test
apiVersion: v1
kind: Pod
metadata:
name: "elasticsearch-qxaid-test"
annotations:
"helm.sh/hook": test-success
spec:
containers:
- name: "elasticsearch-dxggn-test"
image: "docker.elastic.co/elasticsearch/elasticsearch-oss:7.3.2"
command:
- "sh"
- "-c"
- |
#!/usr/bin/env bash -e
curl -XGET --fail 'elasticsearch-master:9200/_cluster/health?wait_for_status=green&timeout=1s'
restartPolicy: Never
MANIFEST:
---
# Source: elasticsearch/templates/poddisruptionbudget.yaml
apiVersion: policy/v1beta1
kind: PodDisruptionBudget
metadata:
name: "elasticsearch-master-pdb"
spec:
maxUnavailable: 1
selector:
matchLabels:
app: "elasticsearch-master"
---
# Source: elasticsearch/templates/service.yaml
kind: Service
apiVersion: v1
metadata:
name: elasticsearch-master
labels:
heritage: "Tiller"
release: "elasticsearch"
chart: "elasticsearch-7.3.0"
app: "elasticsearch-master"
annotations:
{}
spec:
type: ClusterIP
selector:
heritage: "Tiller"
release: "elasticsearch"
chart: "elasticsearch-7.3.0"
app: "elasticsearch-master"
ports:
- name: http
protocol: TCP
port: 9200
- name: transport
protocol: TCP
port: 9300
---
# Source: elasticsearch/templates/service.yaml
kind: Service
apiVersion: v1
metadata:
name: elasticsearch-master-headless
labels:
heritage: "Tiller"
release: "elasticsearch"
chart: "elasticsearch-7.3.0"
app: "elasticsearch-master"
annotations:
service.alpha.kubernetes.io/tolerate-unready-endpoints: "true"
spec:
clusterIP: None # This is needed for statefulset hostnames like elasticsearch-0 to resolve
# Create endpoints also if the related pod isn't ready
publishNotReadyAddresses: true
selector:
app: "elasticsearch-master"
ports:
- name: http
port: 9200
- name: transport
port: 9300
---
# Source: elasticsearch/templates/statefulset.yaml
apiVersion: apps/v1beta1
kind: StatefulSet
metadata:
name: elasticsearch-master
labels:
heritage: "Tiller"
release: "elasticsearch"
chart: "elasticsearch-7.3.0"
app: "elasticsearch-master"
annotations:
esMajorVersion: "7"
spec:
serviceName: elasticsearch-master-headless
selector:
matchLabels:
app: "elasticsearch-master"
replicas: 3
podManagementPolicy: Parallel
updateStrategy:
type: RollingUpdate
volumeClaimTemplates:
- metadata:
name: elasticsearch-master
spec:
accessModes:
- ReadWriteOnce
resources:
requests:
storage: 30Gi
template:
metadata:
name: "elasticsearch-master"
labels:
heritage: "Tiller"
release: "elasticsearch"
chart: "elasticsearch-7.3.0"
app: "elasticsearch-master"
annotations:
spec:
securityContext:
fsGroup: 1000
affinity:
podAntiAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
- labelSelector:
matchExpressions:
- key: app
operator: In
values:
- "elasticsearch-master"
topologyKey: kubernetes.io/hostname
terminationGracePeriodSeconds: 120
volumes:
initContainers:
- name: configure-sysctl
securityContext:
runAsUser: 0
privileged: true
image: "docker.elastic.co/elasticsearch/elasticsearch-oss:7.3.2"
command: ["sysctl", "-w", "vm.max_map_count=262144"]
resources:
{}
containers:
- name: "elasticsearch"
securityContext:
capabilities:
drop:
- ALL
runAsNonRoot: true
runAsUser: 1000
image: "docker.elastic.co/elasticsearch/elasticsearch-oss:7.3.2"
imagePullPolicy: "IfNotPresent"
readinessProbe:
failureThreshold: 3
initialDelaySeconds: 10
periodSeconds: 10
successThreshold: 3
timeoutSeconds: 5
exec:
command:
- sh
- -c
- |
#!/usr/bin/env bash -e
# If the node is starting up wait for the cluster to be ready (request params: 'wait_for_status=green&timeout=1s' )
# Once it has started only check that the node itself is responding
START_FILE=/tmp/.es_start_file
http () {
local path="${1}"
if [ -n "${ELASTIC_USERNAME}" ] && [ -n "${ELASTIC_PASSWORD}" ]; then
BASIC_AUTH="-u ${ELASTIC_USERNAME}:${ELASTIC_PASSWORD}"
else
BASIC_AUTH=''
fi
curl -XGET -s -k --fail ${BASIC_AUTH} http://127.0.0.1:9200${path}
}
if [ -f "${START_FILE}" ]; then
echo 'Elasticsearch is already running, lets check the node is healthy'
http "/"
else
echo 'Waiting for elasticsearch cluster to become cluster to be ready (request params: "wait_for_status=green&timeout=1s" )'
if http "/_cluster/health?wait_for_status=green&timeout=1s" ; then
touch ${START_FILE}
exit 0
else
echo 'Cluster is not yet ready (request params: "wait_for_status=green&timeout=1s" )'
exit 1
fi
fi
ports:
- name: http
containerPort: 9200
- name: transport
containerPort: 9300
resources:
limits:
cpu: 1000m
memory: 2Gi
requests:
cpu: 100m
memory: 2Gi
env:
- name: node.name
valueFrom:
fieldRef:
fieldPath: metadata.name
- name: cluster.initial_master_nodes
value: ""
- name: discovery.seed_hosts
value: "elasticsearch-master-headless"
- name: cluster.name
value: "elasticsearch"
- name: network.host
value: "0.0.0.0"
- name: ES_JAVA_OPTS
value: "-Xmx1g -Xms1g"
- name: node.data
value: "true"
- name: node.ingest
value: "false"
- name: node.master
value: "true"
volumeMounts:
- name: "elasticsearch-master"
mountPath: /usr/share/elasticsearch/data
Describe the bug: We’ve been deploying the Elastic search 7.3.0 helm chart to a freshly built kops k8s cluster without issues for many weeks. However, I’ve tried to rebuild one environment yesterday and it always fails at the Elastic search deployment step now with the following error (see below) and for the sake of god I can’t figure out why it fails.
{"type": "server", "timestamp": "2019-10-22T18:10:22,660+0000", "level": "WARN", "component": "o.e.c.c.ClusterFormationFailureHelper", "cluster.name": "elasticsearch", "node.name": "elasticsearch-master-0", "message": "master not discovered yet, this node has not previously joined a bootstrapped (v7+) cluster, and [cluster.initial_master_nodes] is empty on this node: have discovered [{elasticsearch-master-0}{jaY1E2suRxqrLK_HZxzI6w}{ZKO_C0-QSeSD2r5ZpY2wfA}{100.96.5.4}{100.96.5.4:9300}{dm}, {elasticsearch-master-2}{-xgmZe-2Q-GaZ_-x80LMhQ}{cd3IWwklR8GRAapYdoiISQ}{100.96.4.3}{100.96.4.3:9300}{dm}, {elasticsearch-master-1}{h2qFhHx8SHuwbzWyWM7Xvw}{awYThFu8RNGOilyZfqF9Xg}{100.96.3.6}{100.96.3.6:9300}{dm}]; discovery will continue using [100.96.4.3:9300, 100.96.3.6:9300] from hosts providers and [{elasticsearch-master-0}{jaY1E2suRxqrLK_HZxzI6w}{ZKO_C0-QSeSD2r5ZpY2wfA}{100.96.5.4}{100.96.5.4:9300}{dm}] from last-known cluster state; node term 0, last-accepted version 0 in term 0" }
{"type": "server", "timestamp": "2019-10-22T18:10:27,653+0000", "level": "DEBUG", "component": "o.e.a.a.c.h.TransportClusterHealthAction", "cluster.name": "elasticsearch", "node.name": "elasticsearch-master-0", "message": "no known master node, scheduling a retry" }
{"type": "server", "timestamp": "2019-10-22T18:10:28,654+0000", "level": "DEBUG", "component": "o.e.a.a.c.h.TransportClusterHealthAction", "cluster.name": "elasticsearch", "node.name": "elasticsearch-master-0", "message": "timed out while retrying [cluster:monitor/health] after failure (timeout [1s])" }
{"type": "server", "timestamp": "2019-10-22T18:10:28,654+0000", "level": "WARN", "component": "r.suppressed", "cluster.name": "elasticsearch", "node.name": "elasticsearch-master-0", "message": "path: /_cluster/health, params: {wait_for_status=green, timeout=1s}" ,
"stacktrace": ["org.elasticsearch.discovery.MasterNotDiscoveredException: null",
"at org.elasticsearch.action.support.master.TransportMasterNodeAction$AsyncSingleAction$3.onTimeout(TransportMasterNodeAction.java:251) [elasticsearch-7.3.2.jar:7.3.2]",
"at org.elasticsearch.cluster.ClusterStateObserver$ContextPreservingListener.onTimeout(ClusterStateObserver.java:325) [elasticsearch-7.3.2.jar:7.3.2]",
"at org.elasticsearch.cluster.ClusterStateObserver$ObserverClusterStateListener.onTimeout(ClusterStateObserver.java:252) [elasticsearch-7.3.2.jar:7.3.2]",
"at org.elasticsearch.cluster.service.ClusterApplierService$NotifyTimeout.run(ClusterApplierService.java:572) [elasticsearch-7.3.2.jar:7.3.2]",
"at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:688) [elasticsearch-7.3.2.jar:7.3.2]",
"at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) [?:?]",
"at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) [?:?]",
"at java.lang.Thread.run(Thread.java:835) [?:?]"] }
Steps to reproduce:
- Create new k8s cluster using latest kops 1.14.0 release
- Deploy elastic search helm chart with:
helm install elastic/elasticsearch --name elasticsearch --set image=docker.elastic.co/elasticsearch/elasticsearch-oss --set imageTag=7.3.2 --set roles.ingest=false --version 7.3.0 --atomic
- Check each master pod log. It fails to locate master and form a cluster.
Expected behavior: Successful deployment of elastic search helm chart as before.
Provide logs and/or server output (if relevant):
$ kubectl describe service/elasticsearch-master
---
Name: elasticsearch-master
Namespace: default
Labels: app=elasticsearch-master
chart=elasticsearch-7.3.0
heritage=Tiller
release=elasticsearch
Annotations: <none>
Selector: app=elasticsearch-master,chart=elasticsearch-7.3.0,heritage=Tiller,release=elasticsearch
Type: ClusterIP
IP: 100.67.88.213
Port: http 9200/TCP
TargetPort: 9200/TCP
Endpoints:
Port: transport 9300/TCP
TargetPort: 9300/TCP
Endpoints:
Session Affinity: None
Events: <none>
$ kubectl describe service/elasticsearch-master-headless
---
Name: elasticsearch-master-headless
Namespace: default
Labels: app=elasticsearch-master
chart=elasticsearch-7.3.0
heritage=Tiller
release=elasticsearch
Annotations: service.alpha.kubernetes.io/tolerate-unready-endpoints: true
Selector: app=elasticsearch-master
Type: ClusterIP
IP: None
Port: http 9200/TCP
TargetPort: 9200/TCP
Endpoints: 100.96.3.7:9200,100.96.4.5:9200,100.96.5.5:9200
Port: transport 9300/TCP
TargetPort: 9300/TCP
Endpoints: 100.96.3.7:9300,100.96.4.5:9300,100.96.5.5:9300
Session Affinity: None
Events: <none>
$ kubectl get all
---
NAME READY STATUS RESTARTS AGE
pod/busybox 1/1 Running 0 14m
pod/elasticsearch-master-0 0/1 Running 0 9m49s
pod/elasticsearch-master-1 0/1 Running 0 9m49s
pod/elasticsearch-master-2 0/1 Running 0 9m49s
NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE
service/elasticsearch-master ClusterIP 100.67.88.213 <none> 9200/TCP,9300/TCP 9m49s
service/elasticsearch-master-headless ClusterIP None <none> 9200/TCP,9300/TCP 9m49s
service/kubernetes ClusterIP 100.64.0.1 <none> 443/TCP 47m
NAME READY AGE
statefulset.apps/elasticsearch-master 0/3 9m49s
Issue Analytics
- State:
- Created 4 years ago
- Comments:6 (2 by maintainers)
Top Results From Across the Web
Master node failed. restarting discovery - how to solve ... - Opster
The process known as discovery occurs when an Elasticsearch node starts, restarts or loses contact with the master node for any reason. In...
Read more >Troubleshooting discovery | Elasticsearch Guide [8.5] | Elastic
If your cluster doesn't have a stable master, many of its features won't work correctly and Elasticsearch will report errors to clients and...
Read more >Elasticsearch cluster 'master_not_discovered_exception'
The root cause of master not discovered exception is the nodes are not able to ping each other on port 9300. And this...
Read more >Bootstrapping a cluster
[master-a.example.com] master not discovered yet, this node has not previously joined a bootstrapped (v7+) cluster, and this node must discover master-eligible ...
Read more >Cluster formation - OpenSearch documentation
The nomenclature recently changed for the master node; it is now called the cluster manager node. multi-node cluster architecture diagram. Nodes. The following ......
Read more >Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start FreeTop Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Top GitHub Comments
@jmlrt Understood. Thank you for sharing your thoughts. I think another good practice maybe just to stick with the helm version that is currently used in your
tests/
folder. It just never occurred to me that going from one helm upgrade to another can result in a cluster formation error like this. I guess I’ve learned it the hard way! Properly upgrading apps is a small science already. 😄Well Elastic Helm Charts should be compatible with every Helm v2 release (discloser we certainly won’t be compatible with Helm v3 with the current code) in theory as we don’t have code specific to some release.
However Helm 2.15.0 brought a lot of changes including one breaking change on Chart
apiVersion
and 2 regressions which forced them to release in emergency 2.15.1 a few hours ago (one change was fixed and the other was reverted).If you take a look at these issues, you’ll see that it wasn’t impacted only Elastic charts but many other charts.
Overall I can advise to always test Helm version upgrades in a sandbox environment before (this is right for almost every softwares but specifically for Helm).
In addition I can add a mention of the Helm version that we currently test in the README.
Oh and by the way, meanwhile #338 has been merged and you should now be able to use Helm 2.15.1 with Elastic Charts if your other charts have no issues.