question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

[core] Async Actor Task fails when `max_retries=-1`

See original GitHub issue

What is the problem?

An actor task fails when the actor dies, despite having max_retries=-1 & max_restart=-1

Ray version and other system information (Python version, TensorFlow version, OS):

Reproduction (REQUIRED)

The easiest way to test this is with Serve

  1. Add the following method to python/ray/serve/controller.py::ServeController
    def _test_crash(self):
        os._exit(0)
  1. Add the following between L44 & L45 (the assert) in python/ray/serve/tests/test_standalone.py:
    with pytest.raises(ray.exceptions.RayActorError):
        ray.get(client._controller._test_crash.remote())
  1. Run python -m pytest -sv python/ray/serve/tests/test_standalone.py::test_detached_deployment

Please provide a script that can be run to reproduce the issue. The script should have no external library dependencies (i.e., use fake or mock data / environments):

If we cannot run your script, we cannot fix your issue.

  • I have verified my script runs in a clean environment and reproduces the issue.
  • I have verified the issue also occurs with the latest wheels.

Issue Analytics

  • State:closed
  • Created 3 years ago
  • Comments:14 (14 by maintainers)

github_iconTop GitHub Comments

1reaction
simon-mocommented, Nov 13, 2020

I will work on this issue. (as it blocks Serve fault tolerance)

0reactions
stephanie-wangcommented, Nov 20, 2020

I see. What about just resending the tasks that were already completed? That way, we don’t need to modify the receiver logic at all and we can just save the out-of-order task specs on the sender side. I can actually see an argument for this approach since it follows the same execution semantics that are provided during normal execution, that the execution order follows submission order.

I’m fine with modifying the receiver logic if it’s necessary but I’d prefer not to since it’s nice to keep it free of any recovery logic.

Read more comments on GitHub >

github_iconTop Results From Across the Web

Task on MainActor does not run on the main thread, why?
I would expect that static func main() async throws inherits MainActor async context and will prevent data races, so the final counter value...
Read more >
Problem with handling async inside of an actix-rust actor
Currently I'm checking out Actix, a Rust based actor framework. ... I'm not being able to start a simple async task inside of...
Read more >
Using @MainActor to ensure execution on the main thread
Hi, I'm right now trying to build up a mental model of the new concurrency mechanisms, namely async await, actors and the MainActor....
Read more >
AsyncIO / Concurrency for Actors — Ray 2.2.0
Setting concurrency in Async Actors#. You can set the number of “concurrent” task running at once using the max_concurrency flag. By default, 1000...
Read more >
The Actor Reentrancy Problem in Swift - Swift Senpai
private func authorizeTransaction() async -> Bool { // Wait for 1 second try? await Task.sleep(nanoseconds: 1 * 1000000000) return true }.
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found