[Bug] Alive node reported by ray.node() failed to execute remote task
See original GitHub issueSearch before asking
- I searched the issues and found no similar issues.
Ray Component
Ray Core
What happened + What you expected to happen
On worker node marked as dead(due to memory swaping, another story)
The node with node id: 63049713526c2df2956d5f9296055ed9988cacd237c4c5d2aafe1a5a and ip: 192.168.11.21 has been marked dead because the detector has missed too many heartbeats from it. This can happen when a raylet crashes unexpectedly or has lagging heartbeats
raylet.out of this dead node:
[2022-02-16 12:46:52,789 C 149 149] node_manager.cc:170: This node has beem marked as dead.
*** StackTrace Information ***
ray::SpdLogMessage::Flush()
ray::RayLog::~RayLog()
std::_Function_handler<>::_M_invoke()
std::_Function_handler<>::_M_invoke()
std::_Function_handler<>::_M_invoke()
ray::rpc::ClientCallImpl<>::OnReplyReceived()
std::_Function_handler<>::_M_invoke()
boost::asio::detail::completion_handler<>::do_complete()
boost::asio::detail::scheduler::do_run_one()
boost::asio::detail::scheduler::run()
boost::asio::io_context::run()
main
__libc_start_main
Howerver maybe 1 hour later(I am not aware of how many time elapse since node marked as dead), ray.nodes() show that node is alive:
{'NodeID': 'fd09404717968d7458660c7e24a85106de826d0acfc45fdfcf91a9d3', 'Alive': True, 'NodeManagerAddress': '192.168.11.21', 'NodeManagerHostname': 'GPU-1121', 'NodeManagerPort': 40964, 'ObjectManagerPort': 35237, 'ObjectStoreSocketName': '/tmp/ray/session_2022-02-11_21-11-04_446422_248/sockets/plasma_store.1', 'RayletSocketName': '/tmp/ray/session_2022-02-11_21-11-04_446422_248/sockets/raylet.1', 'MetricsExportPort': 54534, 'alive': True, 'Resources': {'node:192.168.11.21': 1.0, 'GPU': 1.0, 'object_store_memory': 4294967296.0, 'accelerator_type:G': 1.0, 'memory': 27463581696.0, 'CPU': 12.0}}
But, Ray dispatch all new remote task to this node, and all these task failed with:
ray.exceptions.RuntimeEnvSetupError: The runtime_env failed to be set up.
So I can not start any new tasks due to one node marked as dead!
This is one possible way to crash dashboard, checkout #18889. Creating this new issue because I think this is a severe problem, it is better put in a separate issue instead of in a bug report about how dashboard crashed.
Versions / Dependencies
ray 1.10.0
Reproduction script
Overload one worker node, make it not responding to head node as “What happened + What you expected to happen” section described
Anything else
No response
Are you willing to submit a PR?
- Yes I am willing to submit a PR!
Issue Analytics
- State:
- Created 2 years ago
- Comments:8 (8 by maintainers)
Top GitHub Comments
Oh yes, dashboard agent died a few seconds later after raylet marked as dead.
Thanks for the explanation how runtime env work. Yes, process started by agent.py dead that time if I remembered correctly.
Closing this, but @newmanwang please reopen when you have a chance to address @rkooo567 's question. Thanks