question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

raylet crash "Exiting because this node manager has mistakenly been marked dead"

See original GitHub issue

System information

  • OS Platform and Distribution (e.g., Linux Ubuntu 16.04): Ubuntu 18.04.2 LTS
  • Ray installed from (source or binary): source
  • Ray version: be2cbdf1306182007b904b5d976c0c318e6864e9
  • Python version: 3.7.2
  • Exact command to reproduce:

Describe the problem

Running ~8000 tasks on 110 workers or so, we occasionally see raylet processes crash with the below log in raylet.err on the crashing node. The node in question has plenty of free memory at the time of crashing. Do you know what might be going wrong here?

Source code / logs

W0501 14:26:21.914784  4048 task_dependency_manager.cc:258] Task lease to renew has already expired by -92044ms
W0501 14:26:21.914791  4048 task_dependency_manager.cc:258] Task lease to renew has already expired by -91944ms
W0501 14:26:21.914799  4048 node_manager.cc:244] Last heartbeat was sent 130991 ms ago 
F0501 14:26:21.964082  4048 node_manager.cc:395]  Check failed: client_id != gcs_client_->client_table().GetLocalClientId() Exiting because this node manager has mistakenly been marked dead by the monitor.
*** Check failure stack trace: ***
*** Aborted at 1556720781 (unix time) try "date -d @1556720781" if you are using GNU date ***
PC: @                0x0 (unknown)
*** SIGABRT (@0x3e800000fd0) received by PID 4048 (TID 0x7f733dcd6740) from PID 4048; stack trace: ***
    @     0x7f733d8b0890 (unknown)
    @     0x7f733c9a4e97 gsignal
    @     0x7f733c9a6801 abort
    @     0x55d5dd8e9a59 google::logging_fail()
    @     0x55d5dd8ebafa google::LogMessage::Fail()
    @     0x55d5dd8ecf4f google::LogMessage::SendToLog()
    @     0x55d5dd8eb7ab google::LogMessage::Flush()
    @     0x55d5dd8eba01 google::LogMessage::~LogMessage()
    @     0x55d5dd8e8b44 ray::RayLog::~RayLog()
    @     0x55d5dd85f6ff ray::raylet::NodeManager::ClientRemoved()
    @     0x55d5dd8bddde ray::gcs::ClientTable::HandleNotification()
    @     0x55d5dd8c9720 _ZNSt17_Function_handlerIFvPN3ray3gcs14AsyncGcsClientERKNS0_8ClientIDERKSt6vectorI16ClientTableDataTSaIS8_EEEZZNS1_11ClientTable7ConnectERKS8_ENKUlS3_RKNS0_8UniqueIDESG_E_clES3_SJ_SG_EUlS3_SJ_SC_E_E9_M_invokeERKSt9_Any_dataOS3_S6_SC_
    @     0x55d5dd8ce57d _ZNSt17_Function_handlerIFvPN3ray3gcs14AsyncGcsClientERKNS0_8ClientIDE24GcsTableNotificationModeRKSt6vectorI16ClientTableDataTSaIS9_EEEZNS1_3LogIS4_15ClientTableDataE9SubscribeERKNS0_5JobIDES6_RKSt8functionIFvS3_S6_SD_EERKSL_IFvS3_EEEUlS3_S6_S7_SD_E_E9_M_invokeERKSt9_Any_dataOS3_S6_OS7_SD_
    @     0x55d5dd8cbbcb _ZZN3ray3gcs3LogINS_8ClientIDE15ClientTableDataE9SubscribeERKNS_5JobIDERKS2_RKSt8functionIFvPNS0_14AsyncGcsClientES9_24GcsTableNotificationModeRKSt6vectorI16ClientTableDataTSaISF_EEEERKSA_IFvSC_EEENKUlRKNSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEEE_clESZ_
    @     0x55d5dd8d0a50 (anonymous namespace)::ProcessCallback()
    @     0x55d5dd8d206d ray::gcs::SubscribeRedisCallback()
    @     0x55d5dd8d59fc redisProcessCallbacks
    @     0x55d5dd8d482c RedisAsioClient::handle_read()
    @     0x55d5dd8d2fda boost::asio::detail::reactive_null_buffers_op<>::do_complete()
    @     0x55d5dd81c7e9 boost::asio::detail::epoll_reactor::descriptor_state::do_complete()
    @     0x55d5dd81c4c9 boost::asio::detail::scheduler::run()
    @     0x55d5dd80c2dc main
    @     0x7f733c987b97 __libc_start_main
    @     0x55d5dd8141da _start

Issue Analytics

  • State:closed
  • Created 4 years ago
  • Comments:25 (15 by maintainers)

github_iconTop GitHub Comments

2reactions
jovany-wangcommented, May 6, 2019

@markgoodhead Sorry for the delayed reply.

I’m doing the work about integrating the profiling views into dashboard(webui) these days. After that, we can view the metrics through accessing the webui.

If you want to view them right now, you need to start the prometheus server according to the description of https://github.com/ray-project/ray/pull/4246 .

1reaction
markgoodheadcommented, May 4, 2019

@robertnishihara We’ve seen it in a number of different workloads, but the most common one we’ve run (which we’ve seen it on) is with Ray Tune where we have say ~3-5 concurrent hyperparameter samples being run, where each sample itself spins off 3000 (embarrassingly parallel, there are no dependencies between the 3k tasks) ray remote tasks and then waits for all 3000 to finish. Whilst this means we have many thousands of simultaneous ray remote functions running at one time only the 3-5 Ray Tune trainable function calls have any dependencies on any other tasks (the 3k sub tasks).

Read more comments on GitHub >

github_iconTop Results From Across the Web

Node mistakenly marked dead: increase heartbeat timeout?
This can happen when a raylet crashes unexpectedly or has lagging heartbeats. On nodes that are marked dead, I go to raylet.out, and...
Read more >
Re: unexpected exits in node manager - Cloudera Community
Because of Nodemanager exiting, my sqoop job starts reporting error: output-dir already exists. I believe it is because it tried to create ...
Read more >
Ray Documentation - Read the Docs
To test if the installation was successful, try running some tests. ... Python exits. ... the port to use for starting the node...
Read more >
Node-red crashes after a while and no html view -- please help
A few days ago it turned out that, while node-red was running and I did ... Actually my flow is simulating a gate...
Read more >
Out of Control - Amazon S3
and alive. This marriage between life and machines is one of convenience, because, in part, it has been forced by our current technical...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found