question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Raylet crashes when an actor creation task is resubmitted for a dead actor

See original GitHub issue

Describe the problem

After an actor dies (intentionally or due to node failure), we correctly store an exception as the return value for any requested object that should have been created by a method of the actor. However, if you request an object that should have been created by the actor creation task, or the __init__ method, then the Raylet crashes. We should fix this so that we store an exception in the return value.

Source code / logs

The __init__ task gets resubmitted correctly, but then a Raylet tries to execute it. After the task completes, one of the assertions that checks actor state transitions fails.

I0118 17:23:18.909910 14983 node_manager.cc:1779] Resubmitting task 00000000bb941461cef5a715c627e2a88e712171 on client 089d0297f163306e47422f44dfde56fb26d5e3e0
F0118 17:23:18.914819 14983 node_manager.cc:1661]  Check failed: actor_entry->second.GetState() == ActorState::RECONSTRUCTING
*** Check failure stack trace: ***
*** Aborted at 1547860998 (unix time) try "date -d @1547860998" if you are using GNU date ***
PC: @                0x0 (unknown)
*** SIGABRT (@0x3e800003a87) received by PID 14983 (TID 0x7f4d7e2b8740) from PID 14983; stack trace: ***
    @     0x7f4d7deb5390 (unknown)
    @     0x7f4d7d26e428 gsignal
    @     0x7f4d7d27002a abort
    @           0x5da116 google::logging_fail()
    @           0x5da140 google::LogMessage::Fail()
    @           0x5da084 google::LogMessage::SendToLog()
    @           0x5d99c6 google::LogMessage::Flush()
    @           0x5d97c1 google::LogMessage::~LogMessage()
    @           0x509700 ray::RayLog::~RayLog()
    @           0x554019 ray::raylet::NodeManager::FinishAssignedActorTask()
    @           0x554688 ray::raylet::NodeManager::FinishAssignedTask()
    @           0x55482f ray::raylet::NodeManager::ProcessGetTaskMessage()
    @           0x555593 ray::raylet::NodeManager::ProcessClientMessage()
    @           0x4bfbac _ZNSt17_Function_handlerIFvSt10shared_ptrIN3ray16ClientConnectionIN5boost4asio5local15stream_protocolEEEElPKhEZNS1_6raylet6Raylet12HandleAcceptERKNS3_6system10error_codeEEUlS8_lSA_E0_E9_M_invokeERKSt9_Any_dataOS8_OlOSA_
    @           0x514823 ray::ClientConnection<>::ProcessMessage()
    @           0x5109ac boost::asio::detail::read_op<>::operator()()
    @           0x510b75 boost::asio::detail::reactive_socket_recv_op<>::do_complete()
    @           0x4bb5a2 boost::asio::detail::scheduler::run()
    @           0x4aef56 main
    @     0x7f4d7d259830 __libc_start_main
    @           0x4b45b9 _start
    @                0x0 (unknown)

Issue Analytics

  • State:closed
  • Created 5 years ago
  • Comments:5 (5 by maintainers)

github_iconTop GitHub Comments

1reaction
stephanie-wangcommented, Jan 23, 2019

Ah sorry, here’s a script to reproduce the error. The client triggers the error by requesting the object returned by the actor creation task.

import ray

# Force the error to appear sooner.
ray.init(_internal_config='{"initial_reconstruction_timeout_milliseconds": 200}')

@ray.remote
class Actor(object):
    def __init__(self):
        return

a = Actor.remote()
a.__ray_terminate__.remote()
ray.get(a._ray_actor_creation_dummy_object_id)
0reactions
stephanie-wangcommented, Jan 24, 2019

Ah yeah, I agree, and normally that line would just hang if the actor was still alive. I think we should support this case, though, since it is conceivable that someone could call ray.get on an object that was ray.put by the actor creation task.

The reason we’re thinking about this issue right now is so that we can implement the signal API (#3624). That API allows an actor to send signals to another Ray task/actor about its current status. Our current proposal for implementing this is to use the return value object IDs for the actor creation task to store the signal data, although we’re open to other suggestions on how to do this.

Read more comments on GitHub >

github_iconTop Results From Across the Web

How to recover or re-run actor task on a specific worker node ...
This can happen when a raylet crashes unexpectedly or has lagging heartbeats. In this case, restarting a worker node did not help.
Read more >
Getting started with Ray in Python! - Deepnote
Each worker takes on the job of computing a quarter of the total required iterations. Our final result is code that takes 3...
Read more >
Ray Documentation - Read the Docs
When a task is submitted, each argument may be passed in by value or by ... which indicates that the actor doesn't need...
Read more >
ray Changelog - pyup.io
Suppress the logging error when python exits and actor not deleted (27300) ... This includes jobs submitted via the job submission API and...
Read more >
Understanding akka actor default behavior on crash
... restarted is to stop all of that actor's children. This is the reason that the message to sender after the restart goes...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found