`decide_worker` could be expensive on large clusters with queuing enabled
See original GitHub issueWhen queuing is enabled, we currently pick the worker for a queued task by finding the idle worker with the fewest tasks.
This means a linear search over the set of idle workers, which could, at worst, contain all workers.
So on really large clusters (thousands/tens of thousands of workers), this could possibly get expensive, because decide_worker
is called for every task.
Previously, idle
was a SortedSet
, and when there were >20 workers (arbitrary cutoff), we’d just pick one round-robin style, which cost O(logn): https://github.com/dask/distributed/blob/02b9430f7bb6eae370394eaac6b4e8f574746c94/distributed/scheduler.py#L2204
We don’t have data showing performance is a problem here, but we also haven’t benchmarked a really large cluster yet. I don’t want to prematurely optimize, but given that we intend to turn queuing on by default https://github.com/dask/distributed/issues/7213, it would also be bad if the default were slow for large-cluster users.
Do we want to preemptively change this logic? Options I could imagine (simple->complex):
-
Do nothing. With 10k workers, there are probably plenty of other things that are more inefficient than
decide_worker
that we should improve first. Plus,idle
will only be large at the beginning and end of the computation; most of the time it should be quite small. -
If
len(idle)
> some arbitrary cutoff (maybe 20 again), just picknext(iter(self.idle))
. (I’d like to makeidle
no longer sorted since it’s rather expensive and we’re only sorting by name, not something useful https://github.com/dask/distributed/pull/7245.)We could do something simple with CPython set iteration order (or use a
dict[WorkerState, None]
) to make this properly round-robin. -
Maintain a structure binning idle workers by the number of tasks processing. This assumes worker thread counts are relatively small (in the thousands at most). We could find the least-occupied worker in O(1), and updating when tasks are added/removed would be O(1) as well. (Could also use a heap, but then the update would be O(logn). Taking a bucket-sort approach by assuming thread counts are small seems smarter.)
Issue Analytics
- State:
- Created a year ago
- Comments:6 (3 by maintainers)
Top GitHub Comments
Here is a scheduler profile of a 512-worker cluster running
anom_mean
(~500k tasks) under https://github.com/dask/distributed/pull/7257:profile-215d100-sat-1_0-test_anom_mean-clipped.json
(I’ve cropped out the ~7min of idleness waiting for the 700mb graph to upload. That is itself a big issue, but for talking about scheduler performance it makes the percentages you see in speedscope more useful.)
Good news:
decide_worker_rootish_queuing_enabled
is 0.15% of total time. (And about half of it is in__iter__
over the sorted container, which would get even faster with https://github.com/dask/distributed/pull/7245.) For comparison,decide_worker_non_rootish
is 1.6%.This is so tiny that we should stop thinking about it because it doesn’t seem to matter.
Bad news: the scheduler is pretty much 100% busy, and it looks like only 12% of that is even spent on transitions (the meat of the scheduling work). There’s a lot of other overhead. As usual, I think a lot of it (~50%?) is ‘non-blocking’ IO blocking the event loop, plus tornado and asyncio overhead.
That’s a different discussion, but the point is that
decide_worker
doesn’t seem to be very important in the scheme of things.Seems like we’re okay with this, closing.