Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

`Adaptive` doesn't scale quickly for long-running jobs

See original GitHub issue

I’m using dask to coordinate some long-running machine learning jobs. I’ve set up an adaptive cluster (with dask_jobqueue) that has a minimum of 5 workers and a maximum of 10. Each task I dispatch takes about two hours to run and consistently uses ~100% of the CPU available to it. However, the adaptive cluster doesn’t seem to want to add any more workers. It sits at the minimum number and never increases. Is there some way to modify the scheduling policy so that the cluster scales up more aggressively?

I’m aware this isn’t exactly the sort of job dask is designed to schedule – it wants smaller, faster tasks was my impression. I think you might be able to modify Adaptive to use a different policy that’s better suited for long-running jobs? But I spent some time digging into the source and got kinda lost. Any pointers would be helpful 😃

My current workaround is ignoring Adaptive and scaling the cluster by hand. I feel bad for taking up nodes longer than I need though.

Issue Analytics

State:
Created 3 years ago
Comments:5 (5 by maintainers)

Top GitHub Comments

1reaction

fjettercommented, Mar 24, 2020

The decision to scale up or down is currently coupled to the measurement of a tasks runtime. (Similar tasks are grouped, see TaskPrefix) As long as the runtime wasn’t measured at least once, the cluster cannot estimate how long the entire computation graph might take and it will not scale up (see here, it boils down to Scheduler.total_occupancy if you want to go down the rabbit hole). You should see the cluster scale up once the first job finishes.

We’re facing the same issue, see #3516

A workaround for this is to configure default task durations which are used as long as there are no measurements available, e.g.

distributed:
  scheduler:
    default-task-durations:
      my-function: 2h  # This should be the same name you see on the dashboard

0reactions

chrisroatcommented, Feb 7, 2021

Is there a correct incantation of this that would rectify the problem in #4471 ?

Top Results From Across the Web

Correct usage of "cluster.adapt" - Distributed - Dask Forum

I want to use the adaptive scaling for running jobs on HPC clusters, but it keeps crashing after a while. Using the exact...

Future proof: Solving the 'adaptability paradox' for the long term

In this article, we delve into five steps that leaders can take to become more adaptable, including emphasizing both well-being and purpose, ...

Setup adaptive deployments - Dask documentation

Scaling Heuristics The Dask scheduler tracks a variety of information that is useful to correctly allocate the number of workers: The historical runtime...

Adaptive Batch Scheduler Automatically Decide Parallelism of Flink ...

We introduce Apache Flink's adaptive batch scheduler and detail how it can automatically decide parallelism of Flink batch jobs.

Agile at Scale - Harvard Business Review

When implemented correctly, agile innovation teams almost always result in higher team productivity and morale, faster time to market, better quality, ...