Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Canceling long running tasks

See original GitHub issue

Let’s assume you decide to write the following in an interactive shell session (or jupyter notebook):

>>> from threading import current_thread
>>> from time import sleep
>>> from dask import delayed, compute
>>> def print_sleep_repeat():
...     while True:
...         print('I am %s and I am still alive' % current_thread().ident)
...         sleep(1)
...         
>>> stuff = [delayed(print_sleep_repeat)() for i in range(4)]
>>> compute(*stuff)
I am 140279596939008 and I am still alive
I am 140279613724416 and I am still alive
I am 140279415871232 and I am still alive
I am 140279605331712 and I am still alive
[...]

You soon realise that this was a bad idea to submit those annoying tasks your shell. You try to ctrl-c the call to dask.compute but then you realize that it does not prevent the threads to keep on running in the background:

^CTraceback (most recent call last):
  File "<ipython-input-2-c5db33259be8>", line 10, in <module>
    compute(*stuff)
  File "/home/ogrisel/code/dask/dask/base.py", line 110, in compute
    results = get(dsk, keys, **kwargs)
  File "/home/ogrisel/code/dask/dask/threaded.py", line 57, in get
    **kwargs)
  File "/home/ogrisel/code/dask/dask/async.py", line 474, in get_async
    key, res, tb, worker_id = queue.get()
  File "/usr/lib/python3.5/queue.py", line 164, in get
    self.not_empty.wait()
  File "/usr/lib/python3.5/threading.py", line 293, in wait
    waiter.acquire()
KeyboardInterrupt
>>> I am 140279613724416 and I am still alive
I am 140279596939008 and I am still alive
I am 140279415871232 and I am still alive
I am 140279605331712 and I am still alive
I am 140279613724416 and I am still alive
I am 140279605331712 and I am still alive
I am 140279596939008 and I am still alive
I am 140279415871232 and I am still alive

I think interrupting the call to dask.compute should try its best to interrupt the all the scheduled tasks. Possible solutions:

1- Terminate the whole ThreadPool (although I am not even sure that can save this issue) 2- Leverage the ctypes.pythonapi.PyThreadState_SetAsyncExc trick http://stackoverflow.com/questions/323972/is-there-any-way-to-kill-a-thread-in-python 3- Try to use signal.pthread_kill which should make it possible to also kill long running compiled extensions that never reach back into the Python interpreter to receive the PyThreadState_SetAsyncExc interruption.

The ctypes.pythonapi.PyThreadState_SetAsyncExc trick is nice because it should not run the risk of deadlocking the python process by messing with the GIL or other Python run-time state.

Issue Analytics

State:
Created 7 years ago
Reactions:2
Comments:25 (13 by maintainers)

Top GitHub Comments

1reaction

louisabrahamcommented, Aug 29, 2019

@zhanghang1989 No, I implemented it without dask.

I don’t think the Coordination Primitives existed when I wrote my code.

In particular, Global Variables seem to be able to answer to some use cases:

This is often used to signal stopping criteria or current parameters between clients.

Your algorithm could look like that:

stop = Variable('stopping-criterion')
while stop.get() is False:
    # do computation

If you call other functions (from sklearn like in tpot), I would recommend putting them in another process which you can kill when the stop variable is set. Do not forget to use the right parameters to allocate enough cores through your queue system.

0reactions

MichaelSchreiercommented, Jun 5, 2020

I’m in a similar situation where tasks may randomly get stuck (either in C-code that I don’t own or during communication with a database) and the only clean option right now is to completely terminate the entire Cluster and rerun everything. Obviously this is far from ideal.

I just hacked together this minimal example that seems to do the job for the “single-thread-per-worker” case. I can already see multiple gotchas and expect to encounter many more should I go forward and implement something like this in a real system. Use at your own risk, you’ve been warned.

The solution works by creating a custom Scheduler which periodically checks for how long each task has been in the processing state and restarts the associated worker should the lifetime of the task exceed a predefined threshold. In my real-world use case I don’t need the return values of the tasks (they save their results to disk) and can deal with failed tasks in the pipeline. Hence I modify the task to return a default value if it has timed out too often.

import dask
import datetime
import time

dask.config.set({"distributed.logging.distributed": "ERROR"})

from dask.distributed import SpecCluster, Scheduler, Nanny, Client, Variable, get_client
from tornado.ioloop import PeriodicCallback


DEFAULT_TIMEOUT = 12
MAX_TIMEOUTS = 1
DEFAULT_RETURN_VALUE = -1


class TimeoutScheduler(Scheduler):
    def __init__(self, *args, **kwargs):
        super().__init__(*args, **kwargs)

        self.periodic_callbacks["update_timeout_log"] = PeriodicCallback(
            self._check_timeout, 1000
        )
        self.timeout_log = dict()

    async def _check_timeout(self):
        """
        Checks all tasks currently being processed and verifies that they are not taking 
        longer than DEFAULT_TIMEOUT seconds. If so all data currently residing on the 
        corresponding worker is fetched and the worker restarted. If a task fails more than
        MAX_TIMEOUTS times a 'Variable' is set which causes the task to immediately return
        DEFAULT_RETURN_VALUE on starting up the next time it is attempted to being executed.
        """
        for worker_address, worker_state in self.workers.items():
            for task_state, cost in worker_state.processing.items():
                if task_state.key in self.timeout_log:
                    if self.timeout_log[task_state.key]["start_time"] is None:
                        self.timeout_log[task_state.key][
                            "start_time"
                        ] = datetime.datetime.now()

                    # kill worker
                    if (
                        datetime.datetime.now()
                        - self.timeout_log[task_state.key]["start_time"]
                    ).total_seconds() > DEFAULT_TIMEOUT:
                        self.timeout_log[task_state.key]["start_time"] = None
                        self.timeout_log[task_state.key]["n_timeouts"] += 1

                        # set cancel Variable in case n_timeouts exceeds MAX_TIMEOUTS
                        if self.timeout_log[task_state.key]["n_timeouts"] > MAX_TIMEOUTS:
                            print(f"cancelling task {task_state.key}")
                            cancel_keys = self.extensions["variables"].variables[
                                "cancel_keys"
                            ]["value"]
                            cancel_keys[task_state.key] = "cancel"
                            self.extensions["variables"].variables["cancel_keys"][
                                "value"
                            ] = cancel_keys

                        # get connection to Worker's Nanny
                        comm = await self.rpc.connect(worker_state.nanny)

                        # fetch data from worker before killing it
                        await self.replicate(
                            keys=[task.key for task in worker_state.has_what], n=2
                        )

                        # kill worker
                        print(f"killing worker {worker_address}")
                        await comm.write(
                            {"op": "restart"}
                        )  # causes all data stored on that worker to vanish

                else:
                    self.timeout_log[task_state.key] = dict()
                    self.timeout_log[task_state.key]["start_time"] = datetime.datetime.now()
                    self.timeout_log[task_state.key]["n_timeouts"] = 0


def sleep(seconds: int):
    from distributed.worker import thread_state

    key = thread_state.key
    print(f"I'm task '{key}' and I'm supposed to sleep for {seconds} seconds")

    cancel_keys = Variable("cancel_keys", client=get_client()).get()
    if key in cancel_keys and cancel_keys[key] == "cancel":
        print(f"I'm task '{key}' and I'm skipping sleeping")
        return DEFAULT_RETURN_VALUE

    time.sleep(seconds)
    seconds += 5
    return seconds


def print_results(arg):
    print(arg)
    return arg


def run():
    scheduler = {"cls": TimeoutScheduler, "options": {"dashboard_address": ":8787"}}
    workers = {
        "nanny1": {"cls": Nanny, "options": {"nthreads": 1}},
        "nanny2": {"cls": Nanny, "options": {"nthreads": 1}},
    }
    cluster = SpecCluster(scheduler=scheduler, workers=workers)

    client = Client(cluster)

    cancel_keys = Variable("cancel_keys")
    cancel_keys.set(dict())

    graph = {
        "sleep 10": (sleep, 10),
        "sleep 15": (sleep, "sleep 10"),
        "end": (print_results, ["sleep 15"]),
    }

    client.get(graph, "end")

    time.sleep(2)
    client.close()
    cluster.close()


if __name__ == "__main__":
    run()

Output:

I'm task 'sleep 10' and I'm supposed to sleep for 10 seconds
I'm task 'sleep 15' and I'm supposed to sleep for 15 seconds
killing worker tcp://192.168.0.10:60515
I'm task 'sleep 15' and I'm supposed to sleep for 15 seconds
cancelling task sleep 15
killing worker tcp://192.168.0.10:60518
I'm task 'sleep 15' and I'm supposed to sleep for 15 seconds
I'm task 'sleep 15' and I'm skipping sleeping
[-1]