`Adaptive` doesn't scale quickly for long-running jobs
See original GitHub issueI’m using dask to coordinate some long-running machine learning jobs. I’ve set up an adaptive cluster (with dask_jobqueue
) that has a minimum of 5 workers and a maximum of 10. Each task I dispatch takes about two hours to run and consistently uses ~100% of the CPU available to it. However, the adaptive cluster doesn’t seem to want to add any more workers. It sits at the minimum number and never increases. Is there some way to modify the scheduling policy so that the cluster scales up more aggressively?
I’m aware this isn’t exactly the sort of job dask is designed to schedule – it wants smaller, faster tasks was my impression. I think you might be able to modify Adaptive
to use a different policy that’s better suited for long-running jobs? But I spent some time digging into the source and got kinda lost. Any pointers would be helpful 😃
My current workaround is ignoring Adaptive and scaling the cluster by hand. I feel bad for taking up nodes longer than I need though.
Issue Analytics
- State:
- Created 3 years ago
- Comments:5 (5 by maintainers)
Top GitHub Comments
The decision to scale up or down is currently coupled to the measurement of a tasks runtime. (Similar tasks are grouped, see TaskPrefix) As long as the runtime wasn’t measured at least once, the cluster cannot estimate how long the entire computation graph might take and it will not scale up (see here, it boils down to
Scheduler.total_occupancy
if you want to go down the rabbit hole). You should see the cluster scale up once the first job finishes.We’re facing the same issue, see #3516
A workaround for this is to configure default task durations which are used as long as there are no measurements available, e.g.
Is there a correct incantation of this that would rectify the problem in #4471 ?