Retry policy when a worker crashes: a hook missing?
See original GitHub issueSo currently if a worker crashes ray task running on that worker is automatically retried. Some things about that:
- I have not seen any way to control how many times this is retried? If my Python code calls into some C code which segfaults, I do not want it to be retrying all the time.
- How do I know as a caller that this happens? I have not seen anything about this available in
ray.get(...)
. I just see messages printed to my stdout as a caller, but not really something I can detect programmatically? - This assumes that functions are pure or at least idempotent. But what if they are not? Then calling a function again might break things and get things into an undefined state (or into using an undefined state left over from the previous run).
So, maybe what is missing is a hook which is called once a worker dies. Which by default marks the task for retrying, but it can also do something else. Like cleanup broken outside state used by the task, or decide not to run and propagate an error to the caller. I would not mind even that ray.get(...)
simply aborts with an exception saying “worker terminated”, do whatever you want. But ideally, I would prefer to have a hook I can register to handle even task I am not ray.get
-ting.
Issue Analytics
- State:
- Created 5 years ago
- Comments:9 (7 by maintainers)
Top Results From Across the Web
What happens if n8n worker crashes? - Questions
If n8n crashes then by default will the data of all currently active workflows be lost (the only thing not lost would be...
Read more >How do retries/crashes affect a long running polling activity ...
If an activity worker fails then the activity is going to timeout (probably due to missing heartbeat) and will be retried. Yes, a...
Read more >Troubleshoot Azure Automation runbook issues
This article tells how to troubleshoot and resolve issues with Azure Automation runbooks.
Read more >Configuration — Luigi 2.8.13 documentation - Read the Docs
If set to true, Luigi will NOT install this shutdown hook on workers. ... to the given retry-policy, be sure you run luigi...
Read more >Fault Tolerance — Ray 2.2.0 - the Ray documentation
Retries #. When a worker is executing a task, if the worker dies unexpectedly, either because the process crashed or because the machine...
Read more >Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start FreeTop Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Top GitHub Comments
Okay, great! It may be several weeks before something like this would work end-to-end, but it is definitely on our list. The first step is to propagate the backend error to Python. The second step is to provide an experimental
ray.retry
method.Unstale.