question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

`decide_worker` could be expensive on large clusters with queuing enabled

See original GitHub issue

When queuing is enabled, we currently pick the worker for a queued task by finding the idle worker with the fewest tasks.

This means a linear search over the set of idle workers, which could, at worst, contain all workers.

So on really large clusters (thousands/tens of thousands of workers), this could possibly get expensive, because decide_worker is called for every task.

Previously, idle was a SortedSet, and when there were >20 workers (arbitrary cutoff), we’d just pick one round-robin style, which cost O(logn): https://github.com/dask/distributed/blob/02b9430f7bb6eae370394eaac6b4e8f574746c94/distributed/scheduler.py#L2204

We don’t have data showing performance is a problem here, but we also haven’t benchmarked a really large cluster yet. I don’t want to prematurely optimize, but given that we intend to turn queuing on by default https://github.com/dask/distributed/issues/7213, it would also be bad if the default were slow for large-cluster users.

Do we want to preemptively change this logic? Options I could imagine (simple->complex):

  1. Do nothing. With 10k workers, there are probably plenty of other things that are more inefficient than decide_worker that we should improve first. Plus, idle will only be large at the beginning and end of the computation; most of the time it should be quite small.

  2. If len(idle) > some arbitrary cutoff (maybe 20 again), just pick next(iter(self.idle)). (I’d like to make idle no longer sorted since it’s rather expensive and we’re only sorting by name, not something useful https://github.com/dask/distributed/pull/7245.)

    We could do something simple with CPython set iteration order (or use a dict[WorkerState, None]) to make this properly round-robin.

  3. Maintain a structure binning idle workers by the number of tasks processing. This assumes worker thread counts are relatively small (in the thousands at most). We could find the least-occupied worker in O(1), and updating when tasks are added/removed would be O(1) as well. (Could also use a heap, but then the update would be O(logn). Taking a bucket-sort approach by assuming thread counts are small seems smarter.)

cc @fjetter @crusaderky

Issue Analytics

  • State:closed
  • Created a year ago
  • Comments:6 (3 by maintainers)

github_iconTop GitHub Comments

1reaction
gjoseph92commented, Nov 4, 2022

Here is a scheduler profile of a 512-worker cluster running anom_mean (~500k tasks) under https://github.com/dask/distributed/pull/7257:

(I’ve cropped out the ~7min of idleness waiting for the 700mb graph to upload. That is itself a big issue, but for talking about scheduler performance it makes the percentages you see in speedscope more useful.)

Good news: decide_worker_rootish_queuing_enabled is 0.15% of total time. (And about half of it is in __iter__ over the sorted container, which would get even faster with https://github.com/dask/distributed/pull/7245.) For comparison, decide_worker_non_rootish is 1.6%.

This is so tiny that we should stop thinking about it because it doesn’t seem to matter.

Bad news: the scheduler is pretty much 100% busy, and it looks like only 12% of that is even spent on transitions (the meat of the scheduling work). There’s a lot of other overhead. As usual, I think a lot of it (~50%?) is ‘non-blocking’ IO blocking the event loop, plus tornado and asyncio overhead.

That’s a different discussion, but the point is that decide_worker doesn’t seem to be very important in the scheme of things.

0reactions
gjoseph92commented, Nov 10, 2022

Seems like we’re okay with this, closing.

Read more comments on GitHub >

github_iconTop Results From Across the Web

Solved: Issue with queues in HDP 3.1.4 - Cloudera Community
Thus sharing clusters among Companys is a more cost-effective idea. However, Companys are concerned about sharing a cluster because they are worried that...
Read more >
Clustering tips - IBM
Clusters make it easy to connect larger networks with many more queue managers than you would be able to contemplate using distributed queuing....
Read more >
Apache Hadoop 3.3.4 – Hadoop: Capacity Scheduler
The CapacityScheduler is designed to allow sharing a large cluster while giving each organization capacity guarantees.
Read more >
Efficient Queue Management for Cluster Scheduling
Abstract. Job scheduling in Big Data clusters is crucial both for clus- ter operators' return on investment and for overall user ex- perience....
Read more >
Hadoop Cluster - Databricks
It enables big data analytics processing tasks to be broken down into smaller tasks that can be performed in parallel by using an...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found