Help: Scheduler on cluster doesn't seem to work
See original GitHub issueThis might need to be split into several tickets. I just tried to upgrade to a newer version of dask-kubernetes. If I switch on legacy mode, this seems to work fine. But if I switch to the new mode, where the scheduler runs as a separate pod, I run into several issues, but I might miss something which will resolve all three of these:
A small issue: The scheduler will take the same name as a worker (so you don’t know which pod is the scheduler by looking at the name), but worse, it also uses the same resource requests (which it doesn’t really need). Also, because the scheduler runs as a separate container, this will be a nightmare when you the client pod is killed/crashes (not terminated), as it won’t cleanup anything, and instead of the old situation (workers exciting after 60 seconds) the workers and the scheduler will just stick around forever.
The bigger issue: I can’t get it working at all, there are pickle errors when trying to connect to the scheduler both by the worker and the client distributed.protocol.pickle - INFO - Failed to deserialize
, although it seems to be masked by a timeout error.
Is the legacy mode going to disappear in the long run(the name suggests it), or is it safe to keep using it?
Issue Analytics
- State:
- Created 2 years ago
- Comments:7 (7 by maintainers)
Top GitHub Comments
This is the default from the docs (besides the large cpu/mem requests):
Could you share your
spec.yml
too, I’d like to try and reproduce this locally.