question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Have workers query service for scheduler address

See original GitHub issue

Currently a worker needs a scheduler address in order to start. This can cause issues in two situations:

  1. If the scheduler and workers are started at the same time then it can take twice as long to start a meaningful dask cluster than it takes to start a process/vm/pod/… (see https://github.com/dask/distributed/pull/4710#issuecomment-822374645)
  2. If the scheduler goes down and comes up someplace else then the workers need to be redirected

One exception to this situation is the scheduler_file file, which uses a file system as a coordination point between a not-yet-created scheduler and a pool of not-yet-created workers. They all check that file periodically and once the scheduler arrives and writes to it everyone knows where to connect.

We might consider something similar with a web service, where the workers probe a service to find out where they should connect

dask-worker --scheduler-service https://where-is-my-scheduler.com/cluster-id/1234

This would require some sort of basic protocol to be established. (probably simpler than Dask comms, maybe normal web request/response). It would also require us to modify the current logic in the Worker and Nanny on reconnection to the scheduler. I imagine that this conversation would probably look like the following:

  • Worker/Nanny: Hey my-service.com, where is scheduler 1234?
  • my-service.com: I see that that cluster should exist, but I don’t yet have an address for you
  • Worker/Nanny: Hey my-service.com, where is scheduler 1234?
  • my-service.com: I see that that cluster should exist, but I don’t yet have an address for you
  • Worker/Nanny: Hey my-service.com, where is scheduler 1234?
  • my-service.com: Scheduler 1234 is at tls://…
  • Worker/Nanny: Hey tls://…, this is worker 933 checking in …
  • Worker/Nanny: Hey my-service.com, scheduler 1234 at tls://… seems to have gone away and I can’t reconnect. Are they coming back somewhere?
  • my-service.com: Yes, scheduler 1234 is now at tls://…

This would probably be useful for systems like Dask-Gateway and certainly Coiled. I’m curious if it could be made useful for other systems like dask-cloudprovider. cc @jacobtomlinson @selshowk

This also looks like a reinvention of zookeeper I think

Issue Analytics

  • State:open
  • Created 2 years ago
  • Comments:19 (19 by maintainers)

github_iconTop GitHub Comments

1reaction
fjettercommented, Apr 22, 2021

Unless we fully implement some RAFT system

Please don’t XD

0reactions
jacobtomlinsoncommented, Jul 2, 2021

What if it was a very very simple API?

My concerns here are:

  • Where does this code live? In distributed or a side project?
  • Who maintains it? Coiled?
  • How do users deploy it (outside of Coiled)?
  • How do you authenticate users?
  • How do you generate SSL certs for https (lets encrypt?)?
  • Should it support high availability?
  • How does data persist? Is it in memory or is it backed by some data store?

Standing up something like Zookeeper seems heavyweight to me here.

For users outside of a managed platform like Coiled I imagine this would be as complex as running whatever bespoke solution we put together.

I expect for most users something like docker run -d -p 2181:2181 zookeeper would be sufficient. But even a high availability cluster with three replicas looks pretty straight forward.

I view this as similar to the scheduler_file= logic. We did something custom here, yes, but it was super-simple and very pragmatic in the end.

Scheduler file is great because it is simple. My concern here is that building (and more importantly deploying) a bespoke service discovery crosses the line in terms of simplicity. Even if the API is trivial there is a lot more complexity here than a file on a shared filesystem.

Read more comments on GitHub >

github_iconTop Results From Across the Web

Scheduling queries | BigQuery - Google Cloud
Shows how to set up, update, and delete scheduled queries in BigQuery, and describes configuration options, quotas, pricing, and supported regions for ...
Read more >
Understanding SQL Server Schedulers, Workers and Tasks
A scheduler can be described as a piece of software that coordinates the execution of different processes and administers its available ...
Read more >
sys.dm_os_schedulers (Transact-SQL) - SQL Server
Column name Data type Description scheduler_address varbinary(8) Memory address of the scheduler. Is not nullable. status nvarchar(60) load_factor int
Read more >
Single Scheduler — Multiple Worker Architecture With GRPC ...
There are 3 requests that can be made to the scheduler via HTTP. These are to start/stop/query jobs on a specific worker. A...
Read more >
(Optional) Grafana Mimir query-scheduler
Note: The configured query-scheduler address should be in the host:port format. If multiple query-schedulers are running, the host should be a DNS resolving ......
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found