question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

[Bug] Alive node reported by ray.node() failed to execute remote task

See original GitHub issue

Search before asking

  • I searched the issues and found no similar issues.

Ray Component

Ray Core

What happened + What you expected to happen

On worker node marked as dead(due to memory swaping, another story)

The node with node id: 63049713526c2df2956d5f9296055ed9988cacd237c4c5d2aafe1a5a and ip: 192.168.11.21 has been marked dead because the detector has missed too many heartbeats from it. This can happen when a raylet crashes unexpectedly or has lagging heartbeats

raylet.out of this dead node:

[2022-02-16 12:46:52,789 C 149 149] node_manager.cc:170: This node has beem marked as dead.
*** StackTrace Information ***
    ray::SpdLogMessage::Flush()
    ray::RayLog::~RayLog()
    std::_Function_handler<>::_M_invoke()
    std::_Function_handler<>::_M_invoke()
    std::_Function_handler<>::_M_invoke()
    ray::rpc::ClientCallImpl<>::OnReplyReceived()
    std::_Function_handler<>::_M_invoke()
    boost::asio::detail::completion_handler<>::do_complete()
    boost::asio::detail::scheduler::do_run_one()
    boost::asio::detail::scheduler::run()
    boost::asio::io_context::run()
    main
    __libc_start_main

Howerver maybe 1 hour later(I am not aware of how many time elapse since node marked as dead), ray.nodes() show that node is alive:

{'NodeID': 'fd09404717968d7458660c7e24a85106de826d0acfc45fdfcf91a9d3', 'Alive': True, 'NodeManagerAddress': '192.168.11.21', 'NodeManagerHostname': 'GPU-1121', 'NodeManagerPort': 40964, 'ObjectManagerPort': 35237, 'ObjectStoreSocketName': '/tmp/ray/session_2022-02-11_21-11-04_446422_248/sockets/plasma_store.1', 'RayletSocketName': '/tmp/ray/session_2022-02-11_21-11-04_446422_248/sockets/raylet.1', 'MetricsExportPort': 54534, 'alive': True, 'Resources': {'node:192.168.11.21': 1.0, 'GPU': 1.0, 'object_store_memory': 4294967296.0, 'accelerator_type:G': 1.0, 'memory': 27463581696.0, 'CPU': 12.0}}

But, Ray dispatch all new remote task to this node, and all these task failed with:

ray.exceptions.RuntimeEnvSetupError: The runtime_env failed to be set up.

So I can not start any new tasks due to one node marked as dead!

This is one possible way to crash dashboard, checkout #18889. Creating this new issue because I think this is a severe problem, it is better put in a separate issue instead of in a bug report about how dashboard crashed.

Versions / Dependencies

ray 1.10.0

Reproduction script

Overload one worker node, make it not responding to head node as “What happened + What you expected to happen” section described

Anything else

No response

Are you willing to submit a PR?

  • Yes I am willing to submit a PR!

Issue Analytics

  • State:closed
  • Created 2 years ago
  • Comments:8 (8 by maintainers)

github_iconTop GitHub Comments

1reaction
newmanwangcommented, Feb 17, 2022

The particular issue you are seeing is probably because dashboard agent was started. It is a known issue we will solve soon. Can you check if on a dead node ps aux grep | agent.py is alive and check the log /tmp/ray/session_[]/logs/dashboard_agent.log? The runtime env is created by the agent, and if it is not started properly, it can fail to schedule tasks because it can’t create runtime env

Oh yes, dashboard agent died a few seconds later after raylet marked as dead.

2022-02-16 12:47:01,328 ERROR agent.py:126 -- Raylet is dead, exiting.

Thanks for the explanation how runtime env work. Yes, process started by agent.py dead that time if I remembered correctly.

0reactions
zhe-thoughtscommented, Nov 2, 2022

Closing this, but @newmanwang please reopen when you have a chance to address @rkooo567 's question. Thanks

Read more comments on GitHub >

github_iconTop Results From Across the Web

Local object store on worker nodes not working, worker ... - Ray
Whenever I run something on a cluster, I noticed that the plasma % for the worker nodes always stays at 0%. To highlight...
Read more >
Full Text Bug Listing - Red Hat Bugzilla
We have configured rsyncd on the node in question, and the remote node is ... when run locally against the node is reporting...
Read more >
Start script missing error when running npm start
js file. If there is a server.js file in the root of your package, then npm will default the start command to node...
Read more >
How to Fix a GPU Driver Crash - Unreal Engine Documentation
Debugging GPU Crashes. When a crash occurs in Unreal Engine, you may want to start by looking at the callstack generated by Crash...
Read more >
HTML Standard
1.7.1 Serializability of script execution; 1.7.2 Compliance with other specifications; 1.7.3 Extensibility. 1.8 HTML vs XML syntax; 1.9 Structure of this ...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found