Workers timing out, unable to connect to the scheduler
See original GitHub issueHi, all.
I’m trying to use the KubeCluster but all worker pods seem to be unable to reach the scheduler
kubectl logs <dask-jovian-xxx> return this
distributed.nanny - INFO - Start Nanny at: 'tcp://127.0.0.1:39242'
distributed.worker - INFO - Start worker at: tcp://127.0.0.1:45190
distributed.worker - INFO - Listening to: tcp://127.0.0.1:45190
distributed.worker - INFO - Waiting to connect to: tcp://127.0.0.1:35363
distributed.worker - INFO - -------------------------------------------------
distributed.worker - INFO - Threads: 2
distributed.worker - INFO - Memory: 6.00 GB
distributed.worker - INFO - Local Directory: /worker-c_e9c_6d
distributed.worker - INFO - -------------------------------------------------
distributed.worker - INFO - Waiting to connect to: tcp://127.0.0.1:35363
distributed.worker - INFO - Waiting to connect to: tcp://127.0.0.1:35363
distributed.worker - INFO - Waiting to connect to: tcp://127.0.0.1:35363
distributed.worker - INFO - Waiting to connect to: tcp://127.0.0.1:35363
distributed.worker - INFO - Waiting to connect to: tcp://127.0.0.1:35363
distributed.nanny - INFO - Closing Nanny at 'tcp://127.0.0.1:39242'
distributed.worker - INFO - Stopping worker at tcp://127.0.0.1:45190
distributed.worker - INFO - Closed worker has not yet started: None
distributed.nanny - ERROR - Timed out connecting Nanny '<Nanny: None, threads: 2>' to scheduler 'tcp://127.0.0.1:35363'
Traceback (most recent call last):
File "/opt/conda/lib/python3.7/site-packages/distributed/nanny.py", line 309, in instantiate
timedelta(seconds=self.death_timeout), self.process.start()
tornado.util.TimeoutError: Timeout
distributed.dask_worker - INFO - End worker
Traceback (most recent call last):
File "/opt/conda/bin/dask-worker", line 10, in <module>
sys.exit(go())
File "/opt/conda/lib/python3.7/site-packages/distributed/cli/dask_worker.py", line 406, in go
main()
File "/opt/conda/lib/python3.7/site-packages/click/core.py", line 764, in __call__
return self.main(*args, **kwargs)
File "/opt/conda/lib/python3.7/site-packages/click/core.py", line 717, in main
rv = self.invoke(ctx)
File "/opt/conda/lib/python3.7/site-packages/click/core.py", line 956, in invoke
return ctx.invoke(self.callback, **ctx.params)
File "/opt/conda/lib/python3.7/site-packages/click/core.py", line 555, in invoke
return callback(*args, **kwargs)
File "/opt/conda/lib/python3.7/site-packages/distributed/cli/dask_worker.py", line 397, in main
raise TimeoutError("Timed out starting worker.") from None
tornado.util.TimeoutError: Timed out starting worker.
kubernetes cluster runs in ec2
helm values for jupyter (<host>
replaced with my ec2 host)
jupyter:
name: jupyter
enabled: true
image:
repository: "daskdev/dask-notebook"
tag: 1.1.5
pullPolicy: IfNotPresent
pullSecrets:
# - name: regcred
replicas: 1
# serviceType: "ClusterIP"
# serviceType: "NodePort"
serviceType: "LoadBalancer"
servicePort: 80
ingress:
enabled: true
path: /
# Used to create an Ingress record.
hosts:
- <host>
annotations:
kubernetes.io/ingress.class: nginx
nginx.ingress.kubernetes.io/proxy-body-size: "0"
nginx.ingress.kubernetes.io/rewrite-target: /
nginx.ingress.kubernetes.io/add-base-url: "true"
# kubernetes.io/tls-acme: "true"
labels: {}
tls:
#Secrets must be manually created in the namespace.
- secretName: dask-scheduler-secret
hosts:
- <host>
# This hash corresponds to the password 'dask'
password: 'sha1:aae8550c0a44:9507d45e087d5ee481a5ce9f4f16f37a0867318c'
env:
- name: EXTRA_CONDA_PACKAGES
value: "jupyter-server-proxy dask-kubernetes -y -c conda-forge"
tolerations: []
nodeSelector: {}
affinity: {}
role added to get/watch/list/create/delete pods
kind: ClusterRole
apiVersion: rbac.authorization.k8s.io/v1
metadata:
namespace: default
name: service-reader
rules:
- apiGroups: [""] # "" indicates the core API group
resources: ["services", "pods"]
verbs: ["get", "watch", "create", "delete", "list"]
rolebinding
kubectl create clusterrolebinding service-reader-pod \
--clusterrole=service-reader \
--serviceaccount=default:default
worker-spec.yaml
kind: Pod
metadata:
labels:
foo: bar
spec:
restartPolicy: Never
containers:
- image: daskdev/dask:latest
imagePullPolicy: IfNotPresent
args: [dask-worker, --nthreads, '2', --no-dasboard, --memory-limit, 6GB, --death-timeout, '60']
name: dask
jobs seem to be submitted to the scheduler, but no workers available. Worker state is in Running - attempting to connect until --death-timeout
raises TimeoutError
some screenshots https://imgur.com/a/EuvN9em
anything obvious I’m missing?
Love Dask, thanks!
Issue Analytics
- State:
- Created 4 years ago
- Comments:14 (6 by maintainers)
Top Results From Across the Web
Dask Client can't connect to dask-scheduler - Stack Overflow
This may have been about the time we just rewrote the whole system: we dockerized the workers and scheduler at one point.
Read more >"Connection timeout. Job Canceled by scheduler" - error - Forum
We are getting "connection timeout. Job canceled by Scheduler"- error for all the monitors in application template.( eg for SNMP monitor, script monitor...
Read more >Troubleshoot Users Missing in the Scheduler
Solution: Assign users to the schedules they can work. · Solution: Uncheck the Hide Unscheduled Users View Option. · Solution: Unhide single user....
Read more >Scheduling Your Team with Square Shifts
Availability refers to the days and times a team member is able to work. In general, availability is recurring, and is separate from...
Read more >Secure Scheduling - LaborStandards | seattle.gov
Provide a written good faith estimate of median hours employees can expect to work and whether employees will work on-call shifts to new...
Read more >Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start FreeTop Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Top GitHub Comments
Correct, lacked permissions
Needed to create
and bind it
Now works like a charm. Thank you so much for your help!
That help chart isn’t really related to
dask-kubernetes
other than using the same technologies to deploy a similar cluster.Do you have correct permissions for creating pods as per the documentation?