Canceling long running tasks
See original GitHub issueLet’s assume you decide to write the following in an interactive shell session (or jupyter notebook):
>>> from threading import current_thread
>>> from time import sleep
>>> from dask import delayed, compute
>>> def print_sleep_repeat():
... while True:
... print('I am %s and I am still alive' % current_thread().ident)
... sleep(1)
...
>>> stuff = [delayed(print_sleep_repeat)() for i in range(4)]
>>> compute(*stuff)
I am 140279596939008 and I am still alive
I am 140279613724416 and I am still alive
I am 140279415871232 and I am still alive
I am 140279605331712 and I am still alive
[...]
You soon realise that this was a bad idea to submit those annoying tasks your shell. You try to ctrl-c
the call to dask.compute
but then you realize that it does not prevent the threads to keep on running in the background:
^CTraceback (most recent call last):
File "<ipython-input-2-c5db33259be8>", line 10, in <module>
compute(*stuff)
File "/home/ogrisel/code/dask/dask/base.py", line 110, in compute
results = get(dsk, keys, **kwargs)
File "/home/ogrisel/code/dask/dask/threaded.py", line 57, in get
**kwargs)
File "/home/ogrisel/code/dask/dask/async.py", line 474, in get_async
key, res, tb, worker_id = queue.get()
File "/usr/lib/python3.5/queue.py", line 164, in get
self.not_empty.wait()
File "/usr/lib/python3.5/threading.py", line 293, in wait
waiter.acquire()
KeyboardInterrupt
>>> I am 140279613724416 and I am still alive
I am 140279596939008 and I am still alive
I am 140279415871232 and I am still alive
I am 140279605331712 and I am still alive
I am 140279613724416 and I am still alive
I am 140279605331712 and I am still alive
I am 140279596939008 and I am still alive
I am 140279415871232 and I am still alive
I think interrupting the call to dask.compute
should try its best to interrupt the all the scheduled tasks. Possible solutions:
1- Terminate the whole ThreadPool (although I am not even sure that can save this issue)
2- Leverage the ctypes.pythonapi.PyThreadState_SetAsyncExc
trick http://stackoverflow.com/questions/323972/is-there-any-way-to-kill-a-thread-in-python
3- Try to use signal.pthread_kill
which should make it possible to also kill long running compiled extensions that never reach back into the Python interpreter to receive the PyThreadState_SetAsyncExc
interruption.
The ctypes.pythonapi.PyThreadState_SetAsyncExc
trick is nice because it should not run the risk of deadlocking the python process by messing with the GIL or other Python run-time state.
Issue Analytics
- State:
- Created 7 years ago
- Reactions:2
- Comments:25 (13 by maintainers)
Top GitHub Comments
@zhanghang1989 No, I implemented it without dask.
I don’t think the Coordination Primitives existed when I wrote my code.
In particular, Global Variables seem to be able to answer to some use cases:
Your algorithm could look like that:
If you call other functions (from sklearn like in tpot), I would recommend putting them in another process which you can kill when the
stop
variable is set. Do not forget to use the right parameters to allocate enough cores through your queue system.I’m in a similar situation where tasks may randomly get stuck (either in C-code that I don’t own or during communication with a database) and the only clean option right now is to completely terminate the entire
Cluster
and rerun everything. Obviously this is far from ideal.I just hacked together this minimal example that seems to do the job for the “single-thread-per-worker” case. I can already see multiple gotchas and expect to encounter many more should I go forward and implement something like this in a real system. Use at your own risk, you’ve been warned.
The solution works by creating a custom
Scheduler
which periodically checks for how long each task has been in theprocessing
state and restarts the associated worker should the lifetime of the task exceed a predefined threshold. In my real-world use case I don’t need the return values of the tasks (they save their results to disk) and can deal with failed tasks in the pipeline. Hence I modify the task to return a default value if it has timed out too often.Output: