Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

[core] Async Actor Task fails when `max_retries=-1`

See original GitHub issue

What is the problem?

An actor task fails when the actor dies, despite having max_retries=-1 & max_restart=-1

Ray version and other system information (Python version, TensorFlow version, OS):

Reproduction (REQUIRED)

The easiest way to test this is with Serve

Add the following method to python/ray/serve/controller.py::ServeController

    def _test_crash(self):
        os._exit(0)

Add the following between L44 & L45 (the assert) in python/ray/serve/tests/test_standalone.py:

    with pytest.raises(ray.exceptions.RayActorError):
        ray.get(client._controller._test_crash.remote())

Run python -m pytest -sv python/ray/serve/tests/test_standalone.py::test_detached_deployment

Please provide a script that can be run to reproduce the issue. The script should have no external library dependencies (i.e., use fake or mock data / environments):

If we cannot run your script, we cannot fix your issue.

I have verified my script runs in a clean environment and reproduces the issue.
I have verified the issue also occurs with the latest wheels.

Issue Analytics

State:
Created 3 years ago
Comments:14 (14 by maintainers)

Top GitHub Comments

1reaction

simon-mocommented, Nov 13, 2020

I will work on this issue. (as it blocks Serve fault tolerance)

0reactions

stephanie-wangcommented, Nov 20, 2020

I see. What about just resending the tasks that were already completed? That way, we don’t need to modify the receiver logic at all and we can just save the out-of-order task specs on the sender side. I can actually see an argument for this approach since it follows the same execution semantics that are provided during normal execution, that the execution order follows submission order.

I’m fine with modifying the receiver logic if it’s necessary but I’d prefer not to since it’s nice to keep it free of any recovery logic.