What are some KilledWorker scenarios? + how to recover?
See original GitHub issueThis is in regards to: https://github.com/dask/dask-jobqueue/issues/122 and possibly https://github.com/dask/distributed/issues/2297
It seems there are 2 ways that workers can return KilledWorker
(please let me know if there are any more)
- if they’re running out of memory the nanny will kill them and return
KilledWorker
- If the worker is killed by some outside process (think hpc scheduler… ie SLURM)
- Some HPCs have backfill queues, where workers can be killed at any second, but this gives you access to significantly more computing power (generally)
In the first case, I probably want everything to die. in the second case, I want to re-make my workers and retry.
What are some things I should keep in mind when tackling this? Also recreating this seems to be quite difficult, and ideas/suggestions would be appreciated. randomly sending SIGKILL SIGTERM etc doesn’t seem to recreate whatever is happening on my hpc.
an example of a dirty solution for scenario 2: ( i am currently using this and similar changes to as_completed
which appear to be working for the moment)
@@ -1463,10 +1465,23 @@ class Client(Node):
@gen.coroutine
def wait(k):
""" Want to stop the All(...) early if we find an error """
- st = self.futures[k]
- yield st.wait()
- if st.status != 'finished' and errors == 'raise' :
- raise AllExit()
+ while True:
+ st = self.futures[k]
+ yield st.wait()
+ if st.status != 'finished' and errors == 'raise' :
+ try:
+ possible_exception = st.exception
+ except Exception:
+ possible_exception = None
+ if type(possible_exception) == KilledWorker:
+ self.retry([Future(key, self, inform=False)])
+ continue
+ raise AllExit()
+ break
Issue Analytics
- State:
- Created 4 years ago
- Comments:6 (6 by maintainers)
Top GitHub Comments
Thanks. For those passing by, the code above does seem to set the variable, based on doing a get. Going through the API looks like:
Take a look at https://docs.dask.org/en/latest/configuration.html and look for the section on Python API