question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Help: Scheduler on cluster doesn't seem to work

See original GitHub issue

This might need to be split into several tickets. I just tried to upgrade to a newer version of dask-kubernetes. If I switch on legacy mode, this seems to work fine. But if I switch to the new mode, where the scheduler runs as a separate pod, I run into several issues, but I might miss something which will resolve all three of these:

A small issue: The scheduler will take the same name as a worker (so you don’t know which pod is the scheduler by looking at the name), but worse, it also uses the same resource requests (which it doesn’t really need). Also, because the scheduler runs as a separate container, this will be a nightmare when you the client pod is killed/crashes (not terminated), as it won’t cleanup anything, and instead of the old situation (workers exciting after 60 seconds) the workers and the scheduler will just stick around forever.

The bigger issue: I can’t get it working at all, there are pickle errors when trying to connect to the scheduler both by the worker and the client distributed.protocol.pickle - INFO - Failed to deserialize, although it seems to be masked by a timeout error.

Is the legacy mode going to disappear in the long run(the name suggests it), or is it safe to keep using it?

Issue Analytics

  • State:open
  • Created 2 years ago
  • Comments:7 (7 by maintainers)

github_iconTop GitHub Comments

1reaction
pvanderlindencommented, Jun 15, 2021

This is the default from the docs (besides the large cpu/mem requests):

kind: Pod
metadata:
  labels:
    foo: bar
spec:
  restartPolicy: Never
  containers:
  - image: daskdev/dask:latest
    imagePullPolicy: IfNotPresent
    args: [dask-worker, --nthreads, '2', --no-dashboard, --memory-limit, 6GB, --death-timeout, '60']
    name: dask
    env:
      - name: EXTRA_PIP_PACKAGES
        value: git+https://github.com/dask/distributed
    resources:
      requests:
        cpu: "50m"
        memory: 500m
0reactions
jacobtomlinsoncommented, Jun 15, 2021

Could you share your spec.yml too, I’d like to try and reproduce this locally.

Read more comments on GitHub >

github_iconTop Results From Across the Web

The scheduler does not appear to be running | AWS re:Post
Hello I am trying to add my first dag but I am getting the following error: The scheduler does not appear to be...
Read more >
How to Debug Kubernetes Pending Pods and Scheduling ...
Learn how to debug Pending pods that fail to get scheduled due to resource constraints, taints, affinity rules, and other reasons.
Read more >
Scheduler not working or firing - MuleSoft Help Center
In cluster Scheduler runs only on one server - primary one. The one which started first in cluster and has green star on...
Read more >
Complex repeating scheduled task runs only on one target ...
2> When you schedule a task to run on all servers in the cluster, ... There doesnt seems to be a problem with...
Read more >
Solved: Why do scheduled searches randomly stop running in...
The frequency of this issue can vary, and appears to be related to overall scheduler activity. Our production cluster saw it happen every...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found