Have workers query service for scheduler address
See original GitHub issueCurrently a worker needs a scheduler address in order to start. This can cause issues in two situations:
- If the scheduler and workers are started at the same time then it can take twice as long to start a meaningful dask cluster than it takes to start a process/vm/pod/… (see https://github.com/dask/distributed/pull/4710#issuecomment-822374645)
- If the scheduler goes down and comes up someplace else then the workers need to be redirected
One exception to this situation is the scheduler_file
file, which uses a file system as a coordination point between a not-yet-created scheduler and a pool of not-yet-created workers. They all check that file periodically and once the scheduler arrives and writes to it everyone knows where to connect.
We might consider something similar with a web service, where the workers probe a service to find out where they should connect
dask-worker --scheduler-service https://where-is-my-scheduler.com/cluster-id/1234
This would require some sort of basic protocol to be established. (probably simpler than Dask comms, maybe normal web request/response). It would also require us to modify the current logic in the Worker and Nanny on reconnection to the scheduler. I imagine that this conversation would probably look like the following:
- Worker/Nanny: Hey my-service.com, where is scheduler 1234?
- my-service.com: I see that that cluster should exist, but I don’t yet have an address for you
- Worker/Nanny: Hey my-service.com, where is scheduler 1234?
- my-service.com: I see that that cluster should exist, but I don’t yet have an address for you
- Worker/Nanny: Hey my-service.com, where is scheduler 1234?
- my-service.com: Scheduler 1234 is at tls://…
- Worker/Nanny: Hey tls://…, this is worker 933 checking in …
- Worker/Nanny: Hey my-service.com, scheduler 1234 at tls://… seems to have gone away and I can’t reconnect. Are they coming back somewhere?
- my-service.com: Yes, scheduler 1234 is now at tls://…
This would probably be useful for systems like Dask-Gateway and certainly Coiled. I’m curious if it could be made useful for other systems like dask-cloudprovider. cc @jacobtomlinson @selshowk
This also looks like a reinvention of zookeeper I think
Issue Analytics
- State:
- Created 2 years ago
- Comments:19 (19 by maintainers)
Top GitHub Comments
Please don’t XD
My concerns here are:
For users outside of a managed platform like Coiled I imagine this would be as complex as running whatever bespoke solution we put together.
I expect for most users something like
docker run -d -p 2181:2181 zookeeper
would be sufficient. But even a high availability cluster with three replicas looks pretty straight forward.Scheduler file is great because it is simple. My concern here is that building (and more importantly deploying) a bespoke service discovery crosses the line in terms of simplicity. Even if the API is trivial there is a lot more complexity here than a file on a shared filesystem.