raylet crash "Exiting because this node manager has mistakenly been marked dead"
See original GitHub issueSystem information
- OS Platform and Distribution (e.g., Linux Ubuntu 16.04): Ubuntu 18.04.2 LTS
- Ray installed from (source or binary): source
- Ray version: be2cbdf1306182007b904b5d976c0c318e6864e9
- Python version: 3.7.2
- Exact command to reproduce:
Describe the problem
Running ~8000 tasks on 110 workers or so, we occasionally see raylet
processes crash with the below log in raylet.err
on the crashing node. The node in question has plenty of free memory at the time of crashing. Do you know what might be going wrong here?
Source code / logs
W0501 14:26:21.914784 4048 task_dependency_manager.cc:258] Task lease to renew has already expired by -92044ms
W0501 14:26:21.914791 4048 task_dependency_manager.cc:258] Task lease to renew has already expired by -91944ms
W0501 14:26:21.914799 4048 node_manager.cc:244] Last heartbeat was sent 130991 ms ago
F0501 14:26:21.964082 4048 node_manager.cc:395] Check failed: client_id != gcs_client_->client_table().GetLocalClientId() Exiting because this node manager has mistakenly been marked dead by the monitor.
*** Check failure stack trace: ***
*** Aborted at 1556720781 (unix time) try "date -d @1556720781" if you are using GNU date ***
PC: @ 0x0 (unknown)
*** SIGABRT (@0x3e800000fd0) received by PID 4048 (TID 0x7f733dcd6740) from PID 4048; stack trace: ***
@ 0x7f733d8b0890 (unknown)
@ 0x7f733c9a4e97 gsignal
@ 0x7f733c9a6801 abort
@ 0x55d5dd8e9a59 google::logging_fail()
@ 0x55d5dd8ebafa google::LogMessage::Fail()
@ 0x55d5dd8ecf4f google::LogMessage::SendToLog()
@ 0x55d5dd8eb7ab google::LogMessage::Flush()
@ 0x55d5dd8eba01 google::LogMessage::~LogMessage()
@ 0x55d5dd8e8b44 ray::RayLog::~RayLog()
@ 0x55d5dd85f6ff ray::raylet::NodeManager::ClientRemoved()
@ 0x55d5dd8bddde ray::gcs::ClientTable::HandleNotification()
@ 0x55d5dd8c9720 _ZNSt17_Function_handlerIFvPN3ray3gcs14AsyncGcsClientERKNS0_8ClientIDERKSt6vectorI16ClientTableDataTSaIS8_EEEZZNS1_11ClientTable7ConnectERKS8_ENKUlS3_RKNS0_8UniqueIDESG_E_clES3_SJ_SG_EUlS3_SJ_SC_E_E9_M_invokeERKSt9_Any_dataOS3_S6_SC_
@ 0x55d5dd8ce57d _ZNSt17_Function_handlerIFvPN3ray3gcs14AsyncGcsClientERKNS0_8ClientIDE24GcsTableNotificationModeRKSt6vectorI16ClientTableDataTSaIS9_EEEZNS1_3LogIS4_15ClientTableDataE9SubscribeERKNS0_5JobIDES6_RKSt8functionIFvS3_S6_SD_EERKSL_IFvS3_EEEUlS3_S6_S7_SD_E_E9_M_invokeERKSt9_Any_dataOS3_S6_OS7_SD_
@ 0x55d5dd8cbbcb _ZZN3ray3gcs3LogINS_8ClientIDE15ClientTableDataE9SubscribeERKNS_5JobIDERKS2_RKSt8functionIFvPNS0_14AsyncGcsClientES9_24GcsTableNotificationModeRKSt6vectorI16ClientTableDataTSaISF_EEEERKSA_IFvSC_EEENKUlRKNSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEEE_clESZ_
@ 0x55d5dd8d0a50 (anonymous namespace)::ProcessCallback()
@ 0x55d5dd8d206d ray::gcs::SubscribeRedisCallback()
@ 0x55d5dd8d59fc redisProcessCallbacks
@ 0x55d5dd8d482c RedisAsioClient::handle_read()
@ 0x55d5dd8d2fda boost::asio::detail::reactive_null_buffers_op<>::do_complete()
@ 0x55d5dd81c7e9 boost::asio::detail::epoll_reactor::descriptor_state::do_complete()
@ 0x55d5dd81c4c9 boost::asio::detail::scheduler::run()
@ 0x55d5dd80c2dc main
@ 0x7f733c987b97 __libc_start_main
@ 0x55d5dd8141da _start
Issue Analytics
- State:
- Created 4 years ago
- Comments:25 (15 by maintainers)
Top Results From Across the Web
Node mistakenly marked dead: increase heartbeat timeout?
This can happen when a raylet crashes unexpectedly or has lagging heartbeats. On nodes that are marked dead, I go to raylet.out, and...
Read more >Re: unexpected exits in node manager - Cloudera Community
Because of Nodemanager exiting, my sqoop job starts reporting error: output-dir already exists. I believe it is because it tried to create ...
Read more >Ray Documentation - Read the Docs
To test if the installation was successful, try running some tests. ... Python exits. ... the port to use for starting the node...
Read more >Node-red crashes after a while and no html view -- please help
A few days ago it turned out that, while node-red was running and I did ... Actually my flow is simulating a gate...
Read more >Out of Control - Amazon S3
and alive. This marriage between life and machines is one of convenience, because, in part, it has been forced by our current technical...
Read more >Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start FreeTop Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Top GitHub Comments
@markgoodhead Sorry for the delayed reply.
I’m doing the work about integrating the profiling views into dashboard(
webui
) these days. After that, we can view the metrics through accessing the webui.If you want to view them right now, you need to start the prometheus server according to the description of https://github.com/ray-project/ray/pull/4246 .
@robertnishihara We’ve seen it in a number of different workloads, but the most common one we’ve run (which we’ve seen it on) is with Ray Tune where we have say ~3-5 concurrent hyperparameter samples being run, where each sample itself spins off 3000 (embarrassingly parallel, there are no dependencies between the 3k tasks) ray remote tasks and then waits for all 3000 to finish. Whilst this means we have many thousands of simultaneous ray remote functions running at one time only the 3-5 Ray Tune trainable function calls have any dependencies on any other tasks (the 3k sub tasks).