question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Retry policy when a worker crashes: a hook missing?

See original GitHub issue

So currently if a worker crashes ray task running on that worker is automatically retried. Some things about that:

  • I have not seen any way to control how many times this is retried? If my Python code calls into some C code which segfaults, I do not want it to be retrying all the time.
  • How do I know as a caller that this happens? I have not seen anything about this available in ray.get(...). I just see messages printed to my stdout as a caller, but not really something I can detect programmatically?
  • This assumes that functions are pure or at least idempotent. But what if they are not? Then calling a function again might break things and get things into an undefined state (or into using an undefined state left over from the previous run).

So, maybe what is missing is a hook which is called once a worker dies. Which by default marks the task for retrying, but it can also do something else. Like cleanup broken outside state used by the task, or decide not to run and propagate an error to the caller. I would not mind even that ray.get(...) simply aborts with an exception saying “worker terminated”, do whatever you want. But ideally, I would prefer to have a hook I can register to handle even task I am not ray.get-ting.

Issue Analytics

  • State:open
  • Created 5 years ago
  • Comments:9 (7 by maintainers)

github_iconTop GitHub Comments

1reaction
stephanie-wangcommented, Aug 13, 2018

Okay, great! It may be several weeks before something like this would work end-to-end, but it is definitely on our list. The first step is to propagate the backend error to Python. The second step is to provide an experimental ray.retry method.

0reactions
mitarcommented, Mar 27, 2021

Unstale.

Read more comments on GitHub >

github_iconTop Results From Across the Web

What happens if n8n worker crashes? - Questions
If n8n crashes then by default will the data of all currently active workflows be lost (the only thing not lost would be...
Read more >
How do retries/crashes affect a long running polling activity ...
If an activity worker fails then the activity is going to timeout (probably due to missing heartbeat) and will be retried. Yes, a...
Read more >
Troubleshoot Azure Automation runbook issues
This article tells how to troubleshoot and resolve issues with Azure Automation runbooks.
Read more >
Configuration — Luigi 2.8.13 documentation - Read the Docs
If set to true, Luigi will NOT install this shutdown hook on workers. ... to the given retry-policy, be sure you run luigi...
Read more >
Fault Tolerance — Ray 2.2.0 - the Ray documentation
Retries #. When a worker is executing a task, if the worker dies unexpectedly, either because the process crashed or because the machine...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found