Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Retry policy when a worker crashes: a hook missing?

See original GitHub issue

So currently if a worker crashes ray task running on that worker is automatically retried. Some things about that:

I have not seen any way to control how many times this is retried? If my Python code calls into some C code which segfaults, I do not want it to be retrying all the time.
How do I know as a caller that this happens? I have not seen anything about this available in ray.get(...). I just see messages printed to my stdout as a caller, but not really something I can detect programmatically?
This assumes that functions are pure or at least idempotent. But what if they are not? Then calling a function again might break things and get things into an undefined state (or into using an undefined state left over from the previous run).

So, maybe what is missing is a hook which is called once a worker dies. Which by default marks the task for retrying, but it can also do something else. Like cleanup broken outside state used by the task, or decide not to run and propagate an error to the caller. I would not mind even that ray.get(...) simply aborts with an exception saying “worker terminated”, do whatever you want. But ideally, I would prefer to have a hook I can register to handle even task I am not ray.get-ting.

Issue Analytics

State:
Created 5 years ago
Comments:9 (7 by maintainers)

Top GitHub Comments

1reaction

stephanie-wangcommented, Aug 13, 2018

Okay, great! It may be several weeks before something like this would work end-to-end, but it is definitely on our list. The first step is to propagate the backend error to Python. The second step is to provide an experimental ray.retry method.

0reactions

mitarcommented, Mar 27, 2021

Unstale.