Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

raylet crash "Exiting because this node manager has mistakenly been marked dead"

See original GitHub issue

System information

OS Platform and Distribution (e.g., Linux Ubuntu 16.04): Ubuntu 18.04.2 LTS
Ray installed from (source or binary): source
Ray version: be2cbdf1306182007b904b5d976c0c318e6864e9
Python version: 3.7.2
Exact command to reproduce:

Describe the problem

Running ~8000 tasks on 110 workers or so, we occasionally see raylet processes crash with the below log in raylet.err on the crashing node. The node in question has plenty of free memory at the time of crashing. Do you know what might be going wrong here?

Source code / logs

W0501 14:26:21.914784  4048 task_dependency_manager.cc:258] Task lease to renew has already expired by -92044ms
W0501 14:26:21.914791  4048 task_dependency_manager.cc:258] Task lease to renew has already expired by -91944ms
W0501 14:26:21.914799  4048 node_manager.cc:244] Last heartbeat was sent 130991 ms ago 
F0501 14:26:21.964082  4048 node_manager.cc:395]  Check failed: client_id != gcs_client_->client_table().GetLocalClientId() Exiting because this node manager has mistakenly been marked dead by the monitor.
*** Check failure stack trace: ***
*** Aborted at 1556720781 (unix time) try "date -d @1556720781" if you are using GNU date ***
PC: @                0x0 (unknown)
*** SIGABRT (@0x3e800000fd0) received by PID 4048 (TID 0x7f733dcd6740) from PID 4048; stack trace: ***
    @     0x7f733d8b0890 (unknown)
    @     0x7f733c9a4e97 gsignal
    @     0x7f733c9a6801 abort
    @     0x55d5dd8e9a59 google::logging_fail()
    @     0x55d5dd8ebafa google::LogMessage::Fail()
    @     0x55d5dd8ecf4f google::LogMessage::SendToLog()
    @     0x55d5dd8eb7ab google::LogMessage::Flush()
    @     0x55d5dd8eba01 google::LogMessage::~LogMessage()
    @     0x55d5dd8e8b44 ray::RayLog::~RayLog()
    @     0x55d5dd85f6ff ray::raylet::NodeManager::ClientRemoved()
    @     0x55d5dd8bddde ray::gcs::ClientTable::HandleNotification()
    @     0x55d5dd8c9720 _ZNSt17_Function_handlerIFvPN3ray3gcs14AsyncGcsClientERKNS0_8ClientIDERKSt6vectorI16ClientTableDataTSaIS8_EEEZZNS1_11ClientTable7ConnectERKS8_ENKUlS3_RKNS0_8UniqueIDESG_E_clES3_SJ_SG_EUlS3_SJ_SC_E_E9_M_invokeERKSt9_Any_dataOS3_S6_SC_
    @     0x55d5dd8ce57d _ZNSt17_Function_handlerIFvPN3ray3gcs14AsyncGcsClientERKNS0_8ClientIDE24GcsTableNotificationModeRKSt6vectorI16ClientTableDataTSaIS9_EEEZNS1_3LogIS4_15ClientTableDataE9SubscribeERKNS0_5JobIDES6_RKSt8functionIFvS3_S6_SD_EERKSL_IFvS3_EEEUlS3_S6_S7_SD_E_E9_M_invokeERKSt9_Any_dataOS3_S6_OS7_SD_
    @     0x55d5dd8cbbcb _ZZN3ray3gcs3LogINS_8ClientIDE15ClientTableDataE9SubscribeERKNS_5JobIDERKS2_RKSt8functionIFvPNS0_14AsyncGcsClientES9_24GcsTableNotificationModeRKSt6vectorI16ClientTableDataTSaISF_EEEERKSA_IFvSC_EEENKUlRKNSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEEE_clESZ_
    @     0x55d5dd8d0a50 (anonymous namespace)::ProcessCallback()
    @     0x55d5dd8d206d ray::gcs::SubscribeRedisCallback()
    @     0x55d5dd8d59fc redisProcessCallbacks
    @     0x55d5dd8d482c RedisAsioClient::handle_read()
    @     0x55d5dd8d2fda boost::asio::detail::reactive_null_buffers_op<>::do_complete()
    @     0x55d5dd81c7e9 boost::asio::detail::epoll_reactor::descriptor_state::do_complete()
    @     0x55d5dd81c4c9 boost::asio::detail::scheduler::run()
    @     0x55d5dd80c2dc main
    @     0x7f733c987b97 __libc_start_main
    @     0x55d5dd8141da _start

Issue Analytics

State:
Created 4 years ago
Comments:25 (15 by maintainers)

Top GitHub Comments

2reactions

jovany-wangcommented, May 6, 2019

@markgoodhead Sorry for the delayed reply.

I’m doing the work about integrating the profiling views into dashboard(webui) these days. After that, we can view the metrics through accessing the webui.

If you want to view them right now, you need to start the prometheus server according to the description of https://github.com/ray-project/ray/pull/4246 .

1reaction

markgoodheadcommented, May 4, 2019

@robertnishihara We’ve seen it in a number of different workloads, but the most common one we’ve run (which we’ve seen it on) is with Ray Tune where we have say ~3-5 concurrent hyperparameter samples being run, where each sample itself spins off 3000 (embarrassingly parallel, there are no dependencies between the 3k tasks) ray remote tasks and then waits for all 3000 to finish. Whilst this means we have many thousands of simultaneous ray remote functions running at one time only the 3-5 Ray Tune trainable function calls have any dependencies on any other tasks (the 3k sub tasks).