Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Help: Scheduler on cluster doesn't seem to work

See original GitHub issue

This might need to be split into several tickets. I just tried to upgrade to a newer version of dask-kubernetes. If I switch on legacy mode, this seems to work fine. But if I switch to the new mode, where the scheduler runs as a separate pod, I run into several issues, but I might miss something which will resolve all three of these:

A small issue: The scheduler will take the same name as a worker (so you don’t know which pod is the scheduler by looking at the name), but worse, it also uses the same resource requests (which it doesn’t really need). Also, because the scheduler runs as a separate container, this will be a nightmare when you the client pod is killed/crashes (not terminated), as it won’t cleanup anything, and instead of the old situation (workers exciting after 60 seconds) the workers and the scheduler will just stick around forever.

The bigger issue: I can’t get it working at all, there are pickle errors when trying to connect to the scheduler both by the worker and the client distributed.protocol.pickle - INFO - Failed to deserialize, although it seems to be masked by a timeout error.

Is the legacy mode going to disappear in the long run(the name suggests it), or is it safe to keep using it?

Issue Analytics

State:
Created 2 years ago
Comments:7 (7 by maintainers)

Top GitHub Comments

1reaction

pvanderlindencommented, Jun 15, 2021

This is the default from the docs (besides the large cpu/mem requests):

kind: Pod
metadata:
  labels:
    foo: bar
spec:
  restartPolicy: Never
  containers:
  - image: daskdev/dask:latest
    imagePullPolicy: IfNotPresent
    args: [dask-worker, --nthreads, '2', --no-dashboard, --memory-limit, 6GB, --death-timeout, '60']
    name: dask
    env:
      - name: EXTRA_PIP_PACKAGES
        value: git+https://github.com/dask/distributed
    resources:
      requests:
        cpu: "50m"
        memory: 500m

0reactions

jacobtomlinsoncommented, Jun 15, 2021

Could you share your spec.yml too, I’d like to try and reproduce this locally.