question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

`Adaptive` doesn't scale quickly for long-running jobs

See original GitHub issue

I’m using dask to coordinate some long-running machine learning jobs. I’ve set up an adaptive cluster (with dask_jobqueue) that has a minimum of 5 workers and a maximum of 10. Each task I dispatch takes about two hours to run and consistently uses ~100% of the CPU available to it. However, the adaptive cluster doesn’t seem to want to add any more workers. It sits at the minimum number and never increases. Is there some way to modify the scheduling policy so that the cluster scales up more aggressively?

I’m aware this isn’t exactly the sort of job dask is designed to schedule – it wants smaller, faster tasks was my impression. I think you might be able to modify Adaptive to use a different policy that’s better suited for long-running jobs? But I spent some time digging into the source and got kinda lost. Any pointers would be helpful 😃

My current workaround is ignoring Adaptive and scaling the cluster by hand. I feel bad for taking up nodes longer than I need though.

Issue Analytics

  • State:closed
  • Created 3 years ago
  • Comments:5 (5 by maintainers)

github_iconTop GitHub Comments

1reaction
fjettercommented, Mar 24, 2020

The decision to scale up or down is currently coupled to the measurement of a tasks runtime. (Similar tasks are grouped, see TaskPrefix) As long as the runtime wasn’t measured at least once, the cluster cannot estimate how long the entire computation graph might take and it will not scale up (see here, it boils down to Scheduler.total_occupancy if you want to go down the rabbit hole). You should see the cluster scale up once the first job finishes.

We’re facing the same issue, see #3516

A workaround for this is to configure default task durations which are used as long as there are no measurements available, e.g.

distributed:
  scheduler:
    default-task-durations:
      my-function: 2h  # This should be the same name you see on the dashboard
0reactions
chrisroatcommented, Feb 7, 2021

Is there a correct incantation of this that would rectify the problem in #4471 ?

Read more comments on GitHub >

github_iconTop Results From Across the Web

Correct usage of "cluster.adapt" - Distributed - Dask Forum
I want to use the adaptive scaling for running jobs on HPC clusters, but it keeps crashing after a while. Using the exact...
Read more >
Future proof: Solving the 'adaptability paradox' for the long term
In this article, we delve into five steps that leaders can take to become more adaptable, including emphasizing both well-being and purpose, ...
Read more >
Setup adaptive deployments - Dask documentation
Scaling Heuristics​​ The Dask scheduler tracks a variety of information that is useful to correctly allocate the number of workers: The historical runtime...
Read more >
Adaptive Batch Scheduler Automatically Decide Parallelism of Flink ...
We introduce Apache Flink's adaptive batch scheduler and detail how it can automatically decide parallelism of Flink batch jobs.
Read more >
Agile at Scale - Harvard Business Review
When implemented correctly, agile innovation teams almost always result in higher team productivity and morale, faster time to market, better quality, ...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found