Ray actors marked as dead while large file are being written.
See original GitHub issueWhat is the problem?
ray 0.8.2
I have the following setup: I have 2 nodes on my driver and many other nodes on remote machines. One of the nodes on my driver is periodically writing large files to disk. When the disk-writer writes the file, I get the following error:
2020-04-29 20:08:58,747#011WARNING worker.py:1058 -- The node with client ID 116cd6d7effde94a120374931760615a6acec931 has been marked dead because the monitor has missed too many heartbeats from it.
2020-04-29 20:08:58,767#011WARNING worker.py:1058 -- A worker died or was killed while executing task ffffffffffffffffee9d776a0100.
Shortly after, my job crashes.
Writing to disk was taking > 30 seconds. I was able to work around this bug by adjusting the num_heartbeats_timeout parameter in _internal_config
, so that time needed to mark a node as failed is > 30 sec. But it seems a little strange to be modifying something marked “For testing purposes only”.
I have a few questions:
- Why are workers marked dead when I’m writing to disk?
- I eventually write the files to S3; if I upload the objects directly (they’re numpy arrays), will that avoid this issue altogether?
- As long as disk writes take less time than the heartbeat timeout, am I guaranteed to avoid this problem?
Reproduction (REQUIRED)
Will update in a couple days with a script.
Issue Analytics
- State:
- Created 3 years ago
- Reactions:1
- Comments:12 (8 by maintainers)
Top Results From Across the Web
Ray Actor Dying unexpectedly
58.146 has been marked dead because the detector has missed too many heartbeats from it. This can happen when a raylet crashes unexpectedly...
Read more >4. Remote Actors - Scaling Python with Ray [Book] - O'Reilly
When you create a new remote actor, Ray creates a new worker and schedules the actor’s methods on that worker. A common example...
Read more >Marie Curie the scientist | Biog, facts & quotes
Marie Curie is remembered for her discovery of radium and polonium, and her huge contribution to finding treatments for cancer. This work continues...
Read more >Celebrity Deaths in 2022: Stars We've Lost - Us Weekly
Looking back at the celebrities who died in 2022 and how their legacies have ... The Rockford Files actor died at 82, stepson...
Read more >Sylvester Stallone - Wikipedia
Sylvester Enzio Stallone is an American actor and filmmaker. After his beginnings as a struggling actor for a number of years upon arriving...
Read more >Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start FreeTop Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Top GitHub Comments
Is there any way you can see if the same issue happens in 0.8.4?
Duplicates https://github.com/ray-project/ray/issues/11624