Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

What are some KilledWorker scenarios? + how to recover?

See original GitHub issue

This is in regards to: https://github.com/dask/dask-jobqueue/issues/122 and possibly https://github.com/dask/distributed/issues/2297

It seems there are 2 ways that workers can return KilledWorker (please let me know if there are any more)

if they’re running out of memory the nanny will kill them and return KilledWorker
If the worker is killed by some outside process (think hpc scheduler… ie SLURM)
- Some HPCs have backfill queues, where workers can be killed at any second, but this gives you access to significantly more computing power (generally)

In the first case, I probably want everything to die. in the second case, I want to re-make my workers and retry.

What are some things I should keep in mind when tackling this? Also recreating this seems to be quite difficult, and ideas/suggestions would be appreciated. randomly sending SIGKILL SIGTERM etc doesn’t seem to recreate whatever is happening on my hpc.

an example of a dirty solution for scenario 2: ( i am currently using this and similar changes to as_completed which appear to be working for the moment)

@@ -1463,10 +1465,23 @@ class Client(Node):
         @gen.coroutine
         def wait(k):
             """ Want to stop the All(...) early if we find an error """
-            st = self.futures[k]
-            yield st.wait()
-            if st.status != 'finished' and errors == 'raise' :
-                raise AllExit()
+            while True:
+                st = self.futures[k]
+                yield st.wait()
+                if st.status != 'finished' and errors == 'raise' :
+                    try:
+                        possible_exception = st.exception
+                    except Exception:
+                        possible_exception = None
+                    if type(possible_exception) == KilledWorker:
+                        self.retry([Future(key, self, inform=False)])
+                        continue
+                    raise AllExit()
+                break

Issue Analytics

State:
Created 4 years ago
Comments:6 (6 by maintainers)

Top GitHub Comments

1reaction

chrisroatcommented, Feb 21, 2020

Thanks. For those passing by, the code above does seem to set the variable, based on doing a get. Going through the API looks like:

import dask
import distributed
dask.config.set({'distributed.scheduler.allowed-failures': 50})

0reactions

mrocklincommented, Feb 21, 2020

Take a look at https://docs.dask.org/en/latest/configuration.html and look for the section on Python API

Top Results From Across the Web

Why did my worker die? - Dask.distributed

KilledWorker : this means that a particular task was tried on a worker, and it died, and then the same task was sent...

dask - Force tasks that erred due to a Killed Worker to recompute

Currently there is no way to retry a failed task. This is a reasonable request though, so I've opened a Github issue here....

Recovery scenarios - IBM

These scenarios can be resolved by using the backups that were made when following the recommended best practices. Consider the following examples, presented...

KilledWorker Exception — Coiled documentation

Diagnose the error#. The best way to understand what went wrong is to check the worker logs. You might find a traceback with...

Understanding Performance — MiniAn documentation

MiniAn by default would create a dask local cluster for you under the start ... you may see that some steps would result...