Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Workers timing out, unable to connect to the scheduler

See original GitHub issue

Hi, all.

I’m trying to use the KubeCluster but all worker pods seem to be unable to reach the scheduler

kubectl logs <dask-jovian-xxx> return this

distributed.nanny - INFO -         Start Nanny at: 'tcp://127.0.0.1:39242'
distributed.worker - INFO -       Start worker at:      tcp://127.0.0.1:45190
distributed.worker - INFO -          Listening to:      tcp://127.0.0.1:45190
distributed.worker - INFO - Waiting to connect to:      tcp://127.0.0.1:35363
distributed.worker - INFO - -------------------------------------------------
distributed.worker - INFO -               Threads:                          2
distributed.worker - INFO -                Memory:                    6.00 GB
distributed.worker - INFO -       Local Directory:           /worker-c_e9c_6d
distributed.worker - INFO - -------------------------------------------------
distributed.worker - INFO - Waiting to connect to:      tcp://127.0.0.1:35363
distributed.worker - INFO - Waiting to connect to:      tcp://127.0.0.1:35363
distributed.worker - INFO - Waiting to connect to:      tcp://127.0.0.1:35363
distributed.worker - INFO - Waiting to connect to:      tcp://127.0.0.1:35363
distributed.worker - INFO - Waiting to connect to:      tcp://127.0.0.1:35363
distributed.nanny - INFO - Closing Nanny at 'tcp://127.0.0.1:39242'
distributed.worker - INFO - Stopping worker at tcp://127.0.0.1:45190
distributed.worker - INFO - Closed worker has not yet started: None
distributed.nanny - ERROR - Timed out connecting Nanny '<Nanny: None, threads: 2>' to scheduler 'tcp://127.0.0.1:35363'
Traceback (most recent call last):
  File "/opt/conda/lib/python3.7/site-packages/distributed/nanny.py", line 309, in instantiate
    timedelta(seconds=self.death_timeout), self.process.start()
tornado.util.TimeoutError: Timeout
distributed.dask_worker - INFO - End worker
Traceback (most recent call last):
  File "/opt/conda/bin/dask-worker", line 10, in <module>
    sys.exit(go())
  File "/opt/conda/lib/python3.7/site-packages/distributed/cli/dask_worker.py", line 406, in go
    main()
  File "/opt/conda/lib/python3.7/site-packages/click/core.py", line 764, in __call__
    return self.main(*args, **kwargs)
  File "/opt/conda/lib/python3.7/site-packages/click/core.py", line 717, in main
    rv = self.invoke(ctx)
  File "/opt/conda/lib/python3.7/site-packages/click/core.py", line 956, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/opt/conda/lib/python3.7/site-packages/click/core.py", line 555, in invoke
    return callback(*args, **kwargs)
  File "/opt/conda/lib/python3.7/site-packages/distributed/cli/dask_worker.py", line 397, in main
    raise TimeoutError("Timed out starting worker.") from None
tornado.util.TimeoutError: Timed out starting worker.

kubernetes cluster runs in ec2

helm values for jupyter (<host> replaced with my ec2 host)

jupyter:
  name: jupyter
  enabled: true
  image:
    repository: "daskdev/dask-notebook"
    tag: 1.1.5
    pullPolicy: IfNotPresent
    pullSecrets:
    #  - name: regcred
  replicas: 1
  # serviceType: "ClusterIP"
  # serviceType: "NodePort"
  serviceType: "LoadBalancer"
  servicePort: 80
  ingress:
    enabled: true
    path: /
    # Used to create an Ingress record.
    hosts:
      - <host>
    annotations:
      kubernetes.io/ingress.class: nginx
      nginx.ingress.kubernetes.io/proxy-body-size: "0"
      nginx.ingress.kubernetes.io/rewrite-target: /
      nginx.ingress.kubernetes.io/add-base-url: "true"
      # kubernetes.io/tls-acme: "true"
    labels: {}
    tls:
      #Secrets must be manually created in the namespace.
      - secretName: dask-scheduler-secret 
        hosts:
          - <host>

  # This hash corresponds to the password 'dask'
  password: 'sha1:aae8550c0a44:9507d45e087d5ee481a5ce9f4f16f37a0867318c'
  env:
   - name: EXTRA_CONDA_PACKAGES
     value: "jupyter-server-proxy dask-kubernetes -y -c conda-forge"
  tolerations: []
  nodeSelector: {}
  affinity: {}

role added to get/watch/list/create/delete pods

kind: ClusterRole
apiVersion: rbac.authorization.k8s.io/v1
metadata:
  namespace: default
  name: service-reader
rules:
- apiGroups: [""] # "" indicates the core API group
  resources: ["services", "pods"]
  verbs: ["get", "watch", "create", "delete", "list"]

rolebinding

kubectl create clusterrolebinding service-reader-pod \
  --clusterrole=service-reader  \
  --serviceaccount=default:default

worker-spec.yaml

kind: Pod
metadata:
  labels:
    foo: bar
spec:
  restartPolicy: Never
  containers:
  - image: daskdev/dask:latest
    imagePullPolicy: IfNotPresent
    args: [dask-worker, --nthreads, '2', --no-dasboard, --memory-limit, 6GB, --death-timeout, '60']
    name: dask

jobs seem to be submitted to the scheduler, but no workers available. Worker state is in Running - attempting to connect until --death-timeout raises TimeoutError

some screenshots https://imgur.com/a/EuvN9em

anything obvious I’m missing?

Love Dask, thanks!

Issue Analytics

State:
Created 4 years ago
Comments:14 (6 by maintainers)

Top GitHub Comments

1reaction

Techn0logiccommented, Oct 25, 2019

Correct, lacked permissions

Needed to create

kind: Role
apiVersion: rbac.authorization.k8s.io/v1beta1
metadata:
  name: daskKubernetes
rules:
- apiGroups:
  - ""  # indicates the core API group
  resources:
  - "pods"
  verbs:
  - "get"
  - "list"
  - "watch"
  - "create"
  - "delete"
- apiGroups:
  - ""  # indicates the core API group
  resources:
  - "pods/log"
  verbs:
  - "get"
  - "list"
- apiGroups:
  - "" # indicates the core API group
  resources:
  - "services"
  verbs:
  - "get"
  - "list"
  - "watch"
  - "create"
  - "delete"

and bind it

 kubectl create clusterrolebinding dask-pod-creator \      
  --clusterrole=daskKubernetes  \
  --serviceaccount=default:default

Now works like a charm. Thank you so much for your help!

0reactions

jacobtomlinsoncommented, Oct 25, 2019

This is running helm install stable/dask with dask_kubernetes

That help chart isn’t really related to dask-kubernetes other than using the same technologies to deploy a similar cluster.

Do you have correct permissions for creating pods as per the documentation?

Top Results From Across the Web

Dask Client can't connect to dask-scheduler - Stack Overflow

This may have been about the time we just rewrote the whole system: we dockerized the workers and scheduler at one point.

"Connection timeout. Job Canceled by scheduler" - error - Forum

We are getting "connection timeout. Job canceled by Scheduler"- error for all the monitors in application template.( eg for SNMP monitor, script monitor...

Troubleshoot Users Missing in the Scheduler

Solution: Assign users to the schedules they can work. · Solution: Uncheck the Hide Unscheduled Users View Option. · Solution: Unhide single user....

Scheduling Your Team with Square Shifts

Availability refers to the days and times a team member is able to work. In general, availability is recurring, and is separate from...

Secure Scheduling - LaborStandards | seattle.gov

Provide a written good faith estimate of median hours employees can expect to work and whether employees will work on-call shifts to new...