question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

What are some KilledWorker scenarios? + how to recover?

See original GitHub issue

This is in regards to: https://github.com/dask/dask-jobqueue/issues/122 and possibly https://github.com/dask/distributed/issues/2297

It seems there are 2 ways that workers can return KilledWorker (please let me know if there are any more)

  • if they’re running out of memory the nanny will kill them and return KilledWorker
  • If the worker is killed by some outside process (think hpc scheduler… ie SLURM)
    • Some HPCs have backfill queues, where workers can be killed at any second, but this gives you access to significantly more computing power (generally)

In the first case, I probably want everything to die. in the second case, I want to re-make my workers and retry.

What are some things I should keep in mind when tackling this? Also recreating this seems to be quite difficult, and ideas/suggestions would be appreciated. randomly sending SIGKILL SIGTERM etc doesn’t seem to recreate whatever is happening on my hpc.

an example of a dirty solution for scenario 2: ( i am currently using this and similar changes to as_completed which appear to be working for the moment)

@@ -1463,10 +1465,23 @@ class Client(Node):
         @gen.coroutine
         def wait(k):
             """ Want to stop the All(...) early if we find an error """
-            st = self.futures[k]
-            yield st.wait()
-            if st.status != 'finished' and errors == 'raise' :
-                raise AllExit()
+            while True:
+                st = self.futures[k]
+                yield st.wait()
+                if st.status != 'finished' and errors == 'raise' :
+                    try:
+                        possible_exception = st.exception
+                    except Exception:
+                        possible_exception = None
+                    if type(possible_exception) == KilledWorker:
+                        self.retry([Future(key, self, inform=False)])
+                        continue
+                    raise AllExit()
+                break

Issue Analytics

  • State:closed
  • Created 4 years ago
  • Comments:6 (6 by maintainers)

github_iconTop GitHub Comments

1reaction
chrisroatcommented, Feb 21, 2020

Thanks. For those passing by, the code above does seem to set the variable, based on doing a get. Going through the API looks like:

import dask
import distributed
dask.config.set({'distributed.scheduler.allowed-failures': 50}) 
0reactions
mrocklincommented, Feb 21, 2020

Take a look at https://docs.dask.org/en/latest/configuration.html and look for the section on Python API

Read more comments on GitHub >

github_iconTop Results From Across the Web

Why did my worker die? - Dask.distributed
KilledWorker : this means that a particular task was tried on a worker, and it died, and then the same task was sent...
Read more >
dask - Force tasks that erred due to a Killed Worker to recompute
Currently there is no way to retry a failed task. This is a reasonable request though, so I've opened a Github issue here....
Read more >
Recovery scenarios - IBM
These scenarios can be resolved by using the backups that were made when following the recommended best practices. Consider the following examples, presented...
Read more >
KilledWorker Exception — Coiled documentation
Diagnose the error#. The best way to understand what went wrong is to check the worker logs. You might find a traceback with...
Read more >
Understanding Performance — MiniAn documentation
MiniAn by default would create a dask local cluster for you under the start ... you may see that some steps would result...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found