Raylet crashes when an actor creation task is resubmitted for a dead actor
See original GitHub issueDescribe the problem
After an actor dies (intentionally or due to node failure), we correctly store an exception as the return value for any requested object that should have been created by a method of the actor. However, if you request an object that should have been created by the actor creation task, or the __init__
method, then the Raylet crashes. We should fix this so that we store an exception in the return value.
Source code / logs
The __init__
task gets resubmitted correctly, but then a Raylet tries to execute it. After the task completes, one of the assertions that checks actor state transitions fails.
I0118 17:23:18.909910 14983 node_manager.cc:1779] Resubmitting task 00000000bb941461cef5a715c627e2a88e712171 on client 089d0297f163306e47422f44dfde56fb26d5e3e0
F0118 17:23:18.914819 14983 node_manager.cc:1661] Check failed: actor_entry->second.GetState() == ActorState::RECONSTRUCTING
*** Check failure stack trace: ***
*** Aborted at 1547860998 (unix time) try "date -d @1547860998" if you are using GNU date ***
PC: @ 0x0 (unknown)
*** SIGABRT (@0x3e800003a87) received by PID 14983 (TID 0x7f4d7e2b8740) from PID 14983; stack trace: ***
@ 0x7f4d7deb5390 (unknown)
@ 0x7f4d7d26e428 gsignal
@ 0x7f4d7d27002a abort
@ 0x5da116 google::logging_fail()
@ 0x5da140 google::LogMessage::Fail()
@ 0x5da084 google::LogMessage::SendToLog()
@ 0x5d99c6 google::LogMessage::Flush()
@ 0x5d97c1 google::LogMessage::~LogMessage()
@ 0x509700 ray::RayLog::~RayLog()
@ 0x554019 ray::raylet::NodeManager::FinishAssignedActorTask()
@ 0x554688 ray::raylet::NodeManager::FinishAssignedTask()
@ 0x55482f ray::raylet::NodeManager::ProcessGetTaskMessage()
@ 0x555593 ray::raylet::NodeManager::ProcessClientMessage()
@ 0x4bfbac _ZNSt17_Function_handlerIFvSt10shared_ptrIN3ray16ClientConnectionIN5boost4asio5local15stream_protocolEEEElPKhEZNS1_6raylet6Raylet12HandleAcceptERKNS3_6system10error_codeEEUlS8_lSA_E0_E9_M_invokeERKSt9_Any_dataOS8_OlOSA_
@ 0x514823 ray::ClientConnection<>::ProcessMessage()
@ 0x5109ac boost::asio::detail::read_op<>::operator()()
@ 0x510b75 boost::asio::detail::reactive_socket_recv_op<>::do_complete()
@ 0x4bb5a2 boost::asio::detail::scheduler::run()
@ 0x4aef56 main
@ 0x7f4d7d259830 __libc_start_main
@ 0x4b45b9 _start
@ 0x0 (unknown)
Issue Analytics
- State:
- Created 5 years ago
- Comments:5 (5 by maintainers)
Top Results From Across the Web
How to recover or re-run actor task on a specific worker node ...
This can happen when a raylet crashes unexpectedly or has lagging heartbeats. In this case, restarting a worker node did not help.
Read more >Getting started with Ray in Python! - Deepnote
Each worker takes on the job of computing a quarter of the total required iterations. Our final result is code that takes 3...
Read more >Ray Documentation - Read the Docs
When a task is submitted, each argument may be passed in by value or by ... which indicates that the actor doesn't need...
Read more >ray Changelog - pyup.io
Suppress the logging error when python exits and actor not deleted (27300) ... This includes jobs submitted via the job submission API and...
Read more >Understanding akka actor default behavior on crash
... restarted is to stop all of that actor's children. This is the reason that the message to sender after the restart goes...
Read more >Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start FreeTop Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Top GitHub Comments
Ah sorry, here’s a script to reproduce the error. The client triggers the error by requesting the object returned by the actor creation task.
Ah yeah, I agree, and normally that line would just hang if the actor was still alive. I think we should support this case, though, since it is conceivable that someone could call
ray.get
on an object that wasray.put
by the actor creation task.The reason we’re thinking about this issue right now is so that we can implement the signal API (#3624). That API allows an actor to send signals to another Ray task/actor about its current status. Our current proposal for implementing this is to use the return value object IDs for the actor creation task to store the signal data, although we’re open to other suggestions on how to do this.