Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Scheduler behaves badly when adaptively adding workers to meet resource demand

See original GitHub issue

There is an issue with how the scheduler assigns tasks from the unrannable queue to workers who meet the resource requirements joining the scheduler.

The use case is some long running complex task where some tasks require an expensive resource, say GPUs, but those resources are only provided (through Adaptive) once the tasks requiring those resources are ready to be run. Say we come to a point in the computation where 5 tasks could be run, if GPUs were available. Managing the scale_up behaviour through adaptive is fairly straightforward and allows adding new compute nodes (for instance on AWS) with the required resources. The problem appears when the first of the new workers connects.

Scheduler.addWorker will go through the list of unrunnable tasks and check if there are workers that meet the requirements, since only the first worker has thus far connected there is only one worker that meets the requirements (some possibly positive number of workers may or may not booting up and joining shortly, but that hasn’t happened yet).

        for ts in list(self.unrunnable):
            valid = self.valid_workers(ts)
            if valid is True or ws in valid:
                recommendations[ts.key] = 'waiting'

The task goes through released -> waiting and then waiting -> processing, the transition_waiting_processing again calls valid_workers to get a list of workers where the task(s) can be run (this list still just contains a single worker because the other ones haven’t yet connected).

The end result of all of this is that the worker who happens connect first and have the resource required by the tasks gets all the tasks dumped onto it with all the other workers, who potentially connect just seconds later, get nothing and are shutdown by the scheduler because they are idling.

In short, it appears to be the case that the purpose of the resource_requirements is to act as a hint of required peak performance (memory, GPU, whatever) from the workers, and not to be a dynamically changing resource allocation. Is this the case, and is there any interested in changing that? The resources available and resources consumed are taken into account in transition_waiting_processing, but only on the worker_state not for the scheduler in general.

If this is not the intended behaviour and should be fixed, I’m more than happy to work on this.

Issue Analytics

State:
Created 5 years ago
Comments:29 (24 by maintainers)

Top GitHub Comments

1reaction

calebhocommented, May 29, 2019

I’m seeing similar behavior with my workload where all tasks are going to a single worker because other workers haven’t come online yet. This thread is more than a year old, so I’m not sure what’s changed since the beginning of this thread. Anyways, here are my findings w.r.t. worker stealing not working properly with resources.

From what I can tell, stealing logic is implemented in stealing.py with a SchedulerPlugin. Every time a worker is added, state is updated to keep track of what tasks can be stolen from it. In particular, this is the stealable instance attribute. Every time a task transitions, this plugin checks to see whether the task is transitioning to "processing" and calls put_key_in_stealable passing in the task state.

Inside of put_key_in_stealable, the cost of moving the task is computed using steal_time_ratio. The first check in steal_time_ratio is checking whether the task has hard restrictions and whether any of the restrictions are set:

if not ts.loose_restrictions and (
    ts.host_restrictions or ts.worker_restrictions or ts.resource_restrictions
):
    return None, None  # don't steal

Back in put_key_in_stealable, nothing happens if the returned cost is None:

def put_key_in_stealable(self, ts):
    ws = ts.processing_on
    worker = ws.address
    cost_multiplier, level = self.steal_time_ratio(ts)
    self.log.append(("add-stealable", ts.key, worker, level))
    if cost_multiplier is not None:
        self.stealable_all[level].add(ts)
        self.stealable[worker][level].add(ts)
        self.key_stealable[ts] = (worker, level)

This would seem to explain why no work stealing occurs when tasks are marked with resources. Any comments on the analysis here?

Moving on to a possible fix, I think it makes sense to remove the restrictions check mentioned above and add some checks before the call to maybe_move_task in balance to ensure that the thief has the resources required to steal the task. Thoughts?

1reaction

lestevecommented, Feb 11, 2019

sorry I missed the .scale in @leej3’s post, so I guess this is the same problem as .adapt. Here is a slightly simpler snippet to reproduce the problem:

import time
import os
import threading
import pprint
import webbrowser

from dask.distributed import Client, LocalCluster


def do_work(task_time=0.5):
    time.sleep(task_time)
    return (os.getpid(), threading.current_thread().ident)


cluster = LocalCluster(n_workers=1, threads_per_worker=3, resources={'foo': 1})
client = Client(cluster)
dashboard_port = client.scheduler_info()['services']['bokeh']
print('dashboard:', dashboard_port)
# uncomment next two lines if you want to open the dashboard. sleep 5s to give
# time to the tab to load
# webbrowser.open(f'http://localhost:{dashboard_port}/status')
# time.sleep(5)
t0 = time.time()
futures = [client.submit(do_work, pure=False, resources={'foo': 1})
           for i in range(20)]
cluster.scale(2)
output = client.gather(futures)
pprint.pprint(output)
print(time.time() - t0)

Output:

[(11660, 139795962357504),
 (11660, 139795962357504),
 (11660, 139795481229056),
 (11660, 139795962357504),
 (11660, 139795481229056),
 (11660, 139795472836352),
 (11660, 139795481229056),
 (11660, 139795962357504),
 (11660, 139795481229056),
 (11660, 139795472836352),
 (11660, 139795481229056),
 (11660, 139795962357504),
 (11660, 139795481229056),
 (11660, 139795472836352),
 (11660, 139795481229056),
 (11660, 139795962357504),
 (11660, 139795481229056),
 (11660, 139795472836352),
 (11660, 139795481229056),
 (11660, 139795962357504)]
10.067777633666992

Summary of the issue:

submit some tasks using resources
scale up your cluster
workers that are added after the task submissions are never used to do some tasks (bug only exists when using resources)