Scheduler behaves badly when adaptively adding workers to meet resource demand
See original GitHub issueThere is an issue with how the scheduler assigns tasks from the unrannable
queue to workers who meet the resource requirements joining the scheduler.
The use case is some long running complex task where some tasks require an expensive resource, say GPUs, but those resources are only provided (through Adaptive
) once the tasks requiring those resources are ready to be run. Say we come to a point in the computation where 5 tasks could be run, if GPUs were available. Managing the scale_up
behaviour through adaptive is fairly straightforward and allows adding new compute nodes (for instance on AWS) with the required resources. The problem appears when the first of the new workers connects.
Scheduler.addWorker
will go through the list of unrunnable
tasks and check if there are workers that meet the requirements, since only the first worker has thus far connected there is only one worker that meets the requirements (some possibly positive number of workers may or may not booting up and joining shortly, but that hasn’t happened yet).
for ts in list(self.unrunnable):
valid = self.valid_workers(ts)
if valid is True or ws in valid:
recommendations[ts.key] = 'waiting'
The task goes through released -> waiting
and then waiting -> processing
, the transition_waiting_processing
again calls valid_workers
to get a list of workers where the task(s) can be run (this list still just contains a single worker because the other ones haven’t yet connected).
The end result of all of this is that the worker who happens connect first and have the resource required by the tasks gets all the tasks dumped onto it with all the other workers, who potentially connect just seconds later, get nothing and are shutdown by the scheduler because they are idling.
In short, it appears to be the case that the purpose of the resource_requirements
is to act as a hint of required peak performance (memory, GPU, whatever) from the workers, and not to be a dynamically changing resource allocation. Is this the case, and is there any interested in changing that? The resources available and resources consumed are taken into account in transition_waiting_processing
, but only on the worker_state
not for the scheduler in general.
If this is not the intended behaviour and should be fixed, I’m more than happy to work on this.
Issue Analytics
- State:
- Created 5 years ago
- Comments:29 (24 by maintainers)
Top GitHub Comments
I’m seeing similar behavior with my workload where all tasks are going to a single worker because other workers haven’t come online yet. This thread is more than a year old, so I’m not sure what’s changed since the beginning of this thread. Anyways, here are my findings w.r.t. worker stealing not working properly with resources.
From what I can tell, stealing logic is implemented in
stealing.py
with aSchedulerPlugin
. Every time a worker is added, state is updated to keep track of what tasks can be stolen from it. In particular, this is thestealable
instance attribute. Every time a task transitions, this plugin checks to see whether the task is transitioning to"processing"
and callsput_key_in_stealable
passing in the task state.Inside of
put_key_in_stealable
, the cost of moving the task is computed usingsteal_time_ratio
. The first check insteal_time_ratio
is checking whether the task has hard restrictions and whether any of the restrictions are set:Back in
put_key_in_stealable
, nothing happens if the returned cost isNone
:This would seem to explain why no work stealing occurs when tasks are marked with resources. Any comments on the analysis here?
Moving on to a possible fix, I think it makes sense to remove the restrictions check mentioned above and add some checks before the call to
maybe_move_task
inbalance
to ensure that the thief has the resources required to steal the task. Thoughts?sorry I missed the
.scale
in @leej3’s post, so I guess this is the same problem as.adapt
. Here is a slightly simpler snippet to reproduce the problem:Output:
Summary of the issue: