question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Ray actors marked as dead while large file are being written.

See original GitHub issue

What is the problem?

ray 0.8.2

I have the following setup: I have 2 nodes on my driver and many other nodes on remote machines. One of the nodes on my driver is periodically writing large files to disk. When the disk-writer writes the file, I get the following error:

2020-04-29 20:08:58,747#011WARNING worker.py:1058 -- The node with client ID 116cd6d7effde94a120374931760615a6acec931 has been marked dead because the monitor has missed too many heartbeats from it.
2020-04-29 20:08:58,767#011WARNING worker.py:1058 -- A worker died or was killed while executing task ffffffffffffffffee9d776a0100.

Shortly after, my job crashes.

Writing to disk was taking > 30 seconds. I was able to work around this bug by adjusting the num_heartbeats_timeout parameter in _internal_config, so that time needed to mark a node as failed is > 30 sec. But it seems a little strange to be modifying something marked “For testing purposes only”.

I have a few questions:

  1. Why are workers marked dead when I’m writing to disk?
  2. I eventually write the files to S3; if I upload the objects directly (they’re numpy arrays), will that avoid this issue altogether?
  3. As long as disk writes take less time than the heartbeat timeout, am I guaranteed to avoid this problem?

Reproduction (REQUIRED)

Will update in a couple days with a script.

Issue Analytics

  • State:closed
  • Created 3 years ago
  • Reactions:1
  • Comments:12 (8 by maintainers)

github_iconTop GitHub Comments

1reaction
rkooo567commented, Apr 30, 2020

Is there any way you can see if the same issue happens in 0.8.4?

0reactions
ericlcommented, Nov 19, 2020
Read more comments on GitHub >

github_iconTop Results From Across the Web

Ray Actor Dying unexpectedly
58.146 has been marked dead because the detector has missed too many heartbeats from it. This can happen when a raylet crashes unexpectedly...
Read more >
4. Remote Actors - Scaling Python with Ray [Book] - O'Reilly
When you create a new remote actor, Ray creates a new worker and schedules the actor’s methods on that worker. A common example...
Read more >
Marie Curie the scientist | Biog, facts & quotes
Marie Curie is remembered for her discovery of radium and polonium, and her huge contribution to finding treatments for cancer. This work continues...
Read more >
Celebrity Deaths in 2022: Stars We've Lost - Us Weekly
Looking back at the celebrities who died in 2022 and how their legacies have ... The Rockford Files actor died at 82, stepson...
Read more >
Sylvester Stallone - Wikipedia
Sylvester Enzio Stallone is an American actor and filmmaker. After his beginnings as a struggling actor for a number of years upon arriving...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found