question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Workers timing out, unable to connect to the scheduler

See original GitHub issue

Hi, all.

I’m trying to use the KubeCluster but all worker pods seem to be unable to reach the scheduler

kubectl logs <dask-jovian-xxx> return this

distributed.nanny - INFO -         Start Nanny at: 'tcp://127.0.0.1:39242'
distributed.worker - INFO -       Start worker at:      tcp://127.0.0.1:45190
distributed.worker - INFO -          Listening to:      tcp://127.0.0.1:45190
distributed.worker - INFO - Waiting to connect to:      tcp://127.0.0.1:35363
distributed.worker - INFO - -------------------------------------------------
distributed.worker - INFO -               Threads:                          2
distributed.worker - INFO -                Memory:                    6.00 GB
distributed.worker - INFO -       Local Directory:           /worker-c_e9c_6d
distributed.worker - INFO - -------------------------------------------------
distributed.worker - INFO - Waiting to connect to:      tcp://127.0.0.1:35363
distributed.worker - INFO - Waiting to connect to:      tcp://127.0.0.1:35363
distributed.worker - INFO - Waiting to connect to:      tcp://127.0.0.1:35363
distributed.worker - INFO - Waiting to connect to:      tcp://127.0.0.1:35363
distributed.worker - INFO - Waiting to connect to:      tcp://127.0.0.1:35363
distributed.nanny - INFO - Closing Nanny at 'tcp://127.0.0.1:39242'
distributed.worker - INFO - Stopping worker at tcp://127.0.0.1:45190
distributed.worker - INFO - Closed worker has not yet started: None
distributed.nanny - ERROR - Timed out connecting Nanny '<Nanny: None, threads: 2>' to scheduler 'tcp://127.0.0.1:35363'
Traceback (most recent call last):
  File "/opt/conda/lib/python3.7/site-packages/distributed/nanny.py", line 309, in instantiate
    timedelta(seconds=self.death_timeout), self.process.start()
tornado.util.TimeoutError: Timeout
distributed.dask_worker - INFO - End worker
Traceback (most recent call last):
  File "/opt/conda/bin/dask-worker", line 10, in <module>
    sys.exit(go())
  File "/opt/conda/lib/python3.7/site-packages/distributed/cli/dask_worker.py", line 406, in go
    main()
  File "/opt/conda/lib/python3.7/site-packages/click/core.py", line 764, in __call__
    return self.main(*args, **kwargs)
  File "/opt/conda/lib/python3.7/site-packages/click/core.py", line 717, in main
    rv = self.invoke(ctx)
  File "/opt/conda/lib/python3.7/site-packages/click/core.py", line 956, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/opt/conda/lib/python3.7/site-packages/click/core.py", line 555, in invoke
    return callback(*args, **kwargs)
  File "/opt/conda/lib/python3.7/site-packages/distributed/cli/dask_worker.py", line 397, in main
    raise TimeoutError("Timed out starting worker.") from None
tornado.util.TimeoutError: Timed out starting worker.

kubernetes cluster runs in ec2

helm values for jupyter (<host> replaced with my ec2 host)

jupyter:
  name: jupyter
  enabled: true
  image:
    repository: "daskdev/dask-notebook"
    tag: 1.1.5
    pullPolicy: IfNotPresent
    pullSecrets:
    #  - name: regcred
  replicas: 1
  # serviceType: "ClusterIP"
  # serviceType: "NodePort"
  serviceType: "LoadBalancer"
  servicePort: 80
  ingress:
    enabled: true
    path: /
    # Used to create an Ingress record.
    hosts:
      - <host>
    annotations:
      kubernetes.io/ingress.class: nginx
      nginx.ingress.kubernetes.io/proxy-body-size: "0"
      nginx.ingress.kubernetes.io/rewrite-target: /
      nginx.ingress.kubernetes.io/add-base-url: "true"
      # kubernetes.io/tls-acme: "true"
    labels: {}
    tls:
      #Secrets must be manually created in the namespace.
      - secretName: dask-scheduler-secret 
        hosts:
          - <host>

  # This hash corresponds to the password 'dask'
  password: 'sha1:aae8550c0a44:9507d45e087d5ee481a5ce9f4f16f37a0867318c'
  env:
   - name: EXTRA_CONDA_PACKAGES
     value: "jupyter-server-proxy dask-kubernetes -y -c conda-forge"
  tolerations: []
  nodeSelector: {}
  affinity: {}

role added to get/watch/list/create/delete pods

kind: ClusterRole
apiVersion: rbac.authorization.k8s.io/v1
metadata:
  namespace: default
  name: service-reader
rules:
- apiGroups: [""] # "" indicates the core API group
  resources: ["services", "pods"]
  verbs: ["get", "watch", "create", "delete", "list"]

rolebinding

kubectl create clusterrolebinding service-reader-pod \
  --clusterrole=service-reader  \
  --serviceaccount=default:default

worker-spec.yaml

kind: Pod
metadata:
  labels:
    foo: bar
spec:
  restartPolicy: Never
  containers:
  - image: daskdev/dask:latest
    imagePullPolicy: IfNotPresent
    args: [dask-worker, --nthreads, '2', --no-dasboard, --memory-limit, 6GB, --death-timeout, '60']
    name: dask

jobs seem to be submitted to the scheduler, but no workers available. Worker state is in Running - attempting to connect until --death-timeout raises TimeoutError

some screenshots https://imgur.com/a/EuvN9em

anything obvious I’m missing?

Love Dask, thanks!

Issue Analytics

  • State:closed
  • Created 4 years ago
  • Comments:14 (6 by maintainers)

github_iconTop GitHub Comments

1reaction
Techn0logiccommented, Oct 25, 2019

Correct, lacked permissions

Needed to create

kind: Role
apiVersion: rbac.authorization.k8s.io/v1beta1
metadata:
  name: daskKubernetes
rules:
- apiGroups:
  - ""  # indicates the core API group
  resources:
  - "pods"
  verbs:
  - "get"
  - "list"
  - "watch"
  - "create"
  - "delete"
- apiGroups:
  - ""  # indicates the core API group
  resources:
  - "pods/log"
  verbs:
  - "get"
  - "list"
- apiGroups:
  - "" # indicates the core API group
  resources:
  - "services"
  verbs:
  - "get"
  - "list"
  - "watch"
  - "create"
  - "delete"

and bind it

 kubectl create clusterrolebinding dask-pod-creator \      
  --clusterrole=daskKubernetes  \
  --serviceaccount=default:default

Now works like a charm. Thank you so much for your help!

0reactions
jacobtomlinsoncommented, Oct 25, 2019

This is running helm install stable/dask with dask_kubernetes

That help chart isn’t really related to dask-kubernetes other than using the same technologies to deploy a similar cluster.

Do you have correct permissions for creating pods as per the documentation?

Read more comments on GitHub >

github_iconTop Results From Across the Web

Dask Client can't connect to dask-scheduler - Stack Overflow
This may have been about the time we just rewrote the whole system: we dockerized the workers and scheduler at one point.
Read more >
"Connection timeout. Job Canceled by scheduler" - error - Forum
We are getting "connection timeout. Job canceled by Scheduler"- error for all the monitors in application template.( eg for SNMP monitor, script monitor...
Read more >
Troubleshoot Users Missing in the Scheduler
Solution: Assign users to the schedules they can work. · Solution: Uncheck the Hide Unscheduled Users View Option. · Solution: Unhide single user....
Read more >
Scheduling Your Team with Square Shifts
Availability refers to the days and times a team member is able to work. In general, availability is recurring, and is separate from...
Read more >
Secure Scheduling - LaborStandards | seattle.gov
Provide a written good faith estimate of median hours employees can expect to work and whether employees will work on-call shifts to new...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found