Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Raylet crashes when an actor creation task is resubmitted for a dead actor

See original GitHub issue

Describe the problem

After an actor dies (intentionally or due to node failure), we correctly store an exception as the return value for any requested object that should have been created by a method of the actor. However, if you request an object that should have been created by the actor creation task, or the __init__ method, then the Raylet crashes. We should fix this so that we store an exception in the return value.

Source code / logs

The __init__ task gets resubmitted correctly, but then a Raylet tries to execute it. After the task completes, one of the assertions that checks actor state transitions fails.

I0118 17:23:18.909910 14983 node_manager.cc:1779] Resubmitting task 00000000bb941461cef5a715c627e2a88e712171 on client 089d0297f163306e47422f44dfde56fb26d5e3e0
F0118 17:23:18.914819 14983 node_manager.cc:1661]  Check failed: actor_entry->second.GetState() == ActorState::RECONSTRUCTING
*** Check failure stack trace: ***
*** Aborted at 1547860998 (unix time) try "date -d @1547860998" if you are using GNU date ***
PC: @                0x0 (unknown)
*** SIGABRT (@0x3e800003a87) received by PID 14983 (TID 0x7f4d7e2b8740) from PID 14983; stack trace: ***
    @     0x7f4d7deb5390 (unknown)
    @     0x7f4d7d26e428 gsignal
    @     0x7f4d7d27002a abort
    @           0x5da116 google::logging_fail()
    @           0x5da140 google::LogMessage::Fail()
    @           0x5da084 google::LogMessage::SendToLog()
    @           0x5d99c6 google::LogMessage::Flush()
    @           0x5d97c1 google::LogMessage::~LogMessage()
    @           0x509700 ray::RayLog::~RayLog()
    @           0x554019 ray::raylet::NodeManager::FinishAssignedActorTask()
    @           0x554688 ray::raylet::NodeManager::FinishAssignedTask()
    @           0x55482f ray::raylet::NodeManager::ProcessGetTaskMessage()
    @           0x555593 ray::raylet::NodeManager::ProcessClientMessage()
    @           0x4bfbac _ZNSt17_Function_handlerIFvSt10shared_ptrIN3ray16ClientConnectionIN5boost4asio5local15stream_protocolEEEElPKhEZNS1_6raylet6Raylet12HandleAcceptERKNS3_6system10error_codeEEUlS8_lSA_E0_E9_M_invokeERKSt9_Any_dataOS8_OlOSA_
    @           0x514823 ray::ClientConnection<>::ProcessMessage()
    @           0x5109ac boost::asio::detail::read_op<>::operator()()
    @           0x510b75 boost::asio::detail::reactive_socket_recv_op<>::do_complete()
    @           0x4bb5a2 boost::asio::detail::scheduler::run()
    @           0x4aef56 main
    @     0x7f4d7d259830 __libc_start_main
    @           0x4b45b9 _start
    @                0x0 (unknown)

Issue Analytics

State:
Created 5 years ago
Comments:5 (5 by maintainers)

Top GitHub Comments

1reaction

stephanie-wangcommented, Jan 23, 2019

Ah sorry, here’s a script to reproduce the error. The client triggers the error by requesting the object returned by the actor creation task.

import ray

# Force the error to appear sooner.
ray.init(_internal_config='{"initial_reconstruction_timeout_milliseconds": 200}')

@ray.remote
class Actor(object):
    def __init__(self):
        return

a = Actor.remote()
a.__ray_terminate__.remote()
ray.get(a._ray_actor_creation_dummy_object_id)

0reactions

stephanie-wangcommented, Jan 24, 2019

Ah yeah, I agree, and normally that line would just hang if the actor was still alive. I think we should support this case, though, since it is conceivable that someone could call ray.get on an object that was ray.put by the actor creation task.

The reason we’re thinking about this issue right now is so that we can implement the signal API (#3624). That API allows an actor to send signals to another Ray task/actor about its current status. Our current proposal for implementing this is to use the return value object IDs for the actor creation task to store the signal data, although we’re open to other suggestions on how to do this.