Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Cannot recreate a child actor

See original GitHub issue

What is the problem?

Ray version and other system information (Python version, TensorFlow version, OS): Ray 0.9dev

The problem occurs for tasks/actors that are automatically restarted by Ray on a crash. When the task or actor creates a child actor, then crashes, I expect the child actor to get recreated when the parent task/actor is restarted. Instead, any tasks submitted to the child actor fail with a RayActorError.

Reproduction (REQUIRED)

I tried two different versions of the script. One has the default @ray.remote decorator for the child actor, the other has the options (max_restarts=-1, max_task_retries=-1).

import ray
import time
import os

@ray.remote
class Actor:
    def __init__(self):
        pass
    def ready(self):
        return

@ray.remote(max_restarts=-1, max_task_retries=-1)
class Parent:
    def __init__(self):
        self.child = Actor.remote()
    def ready(self):
        return ray.get(self.child.ready.remote())
    def pid(self):
        return os.getpid()


ray.init()
p = Parent.remote()
pid = ray.get(p.pid.remote())
os.kill(pid, 9)
print("ready", ray.get(p.ready.remote()))

Output:

2020-09-01 14:04:57,259 INFO resource_spec.py:250 -- Starting Ray with 4.35 GiB memory available for workers and up to 2.18 GiB for objects. You can adjust these settings with ray.init(memory=<bytes>, object_store_memory=<bytes>).
2020-09-01 14:04:57,732 INFO services.py:1201 -- View the Ray dashboard at 127.0.0.1:8265
E0901 14:04:59.840095  2653  2892 task_manager.cc:304] infinite retries left for task 7bbd90284b71e599df5a1a8201000000, attempting to resubmit.
E0901 14:04:59.840198  2653  2892 core_worker.cc:410] Will resubmit task after a 5000ms delay: Type=ACTOR_TASK, Language=PYTHON, Resources: {CPU: 1, }, function_descriptor={type=PythonFunctionDescriptor, module_name=__main__, class_name=Parent, function_name=ready, function_hash=}, task_id=7bbd90284b71e599df5a1a8201000000, job_id=01000000, num_args=0, num_returns=2, actor_task_spec={actor_id=df5a1a8201000000, actor_caller_id=ffffffffffffffffffffffff01000000, actor_counter=1}
2020-09-01 14:04:59,850 WARNING worker.py:1067 -- A worker died or was killed while executing task ffffffffffffffffdf5a1a8201000000.
Traceback (most recent call last):
  File "test_actor.py", line 26, in <module>
    print("ready", ray.get(p.ready.remote()))
  File "/home/swang/anaconda3/envs/ray-wheel/lib/python3.7/site-packages/ray/worker.py", line 1423, in get
    raise value.as_instanceof_cause()
ray.exceptions.RayTaskError: ray::Parent.ready() (pid=2743, ip=192.168.1.46)
  File "python/ray/_raylet.pyx", line 480, in ray._raylet.execute_task
    with ray.worker._changeproctitle(title, next_title):
  File "python/ray/_raylet.pyx", line 481, in ray._raylet.execute_task
    outputs = function_executor(*args, **kwargs)
  File "python/ray/_raylet.pyx", line 434, in ray._raylet.execute_task.function_executor
    return function(actor, *arguments, **kwarguments)
  File "test_actor.py", line 17, in ready
    return ray.get(self.child.ready.remote())
ray.exceptions.RayActorError: The actor died unexpectedly before finishing this task.
(pid=2743) E0901 14:05:05.768839  2743  2827 task_manager.cc:323] Task failed: IOError: cancelling all pending tasks of dead actor: Type=ACTOR_TASK, Language=PYTHON, Resources: {CPU: 1, }, function_descriptor={type=PythonFunctionDescriptor, module_name=__main__, class_name=Actor, function_name=ready, function_hash=}, task_id=150a9d56b40e3700bdff035801000000, job_id=01000000, num_args=0, num_returns=2, actor_task_spec={actor_id=bdff035801000000, actor_caller_id=ffffffffffffffffdf5a1a8201000000, actor_counter=0}

Output with @ray.remote(max_restarts=-1, max_task_retries=-1) for the child actor:

2020-09-01 14:05:20,903 INFO resource_spec.py:250 -- Starting Ray with 4.35 GiB memory available for workers and up to 2.17 GiB for objects. You can adjust these settings with ray.init(memory=<bytes>, object_store_memory=<bytes>).
2020-09-01 14:05:21,350 INFO services.py:1201 -- View the Ray dashboard at 127.0.0.1:8265
E0901 14:05:23.458299  2936  3212 task_manager.cc:304] infinite retries left for task 7bbd90284b71e599df5a1a8201000000, attempting to resubmit.
E0901 14:05:23.458393  2936  3212 core_worker.cc:410] Will resubmit task after a 5000ms delay: Type=ACTOR_TASK, Language=PYTHON, Resources: {CPU: 1, }, function_descriptor={type=PythonFunctionDescriptor, module_name=__main__, class_name=Parent, function_name=ready, function_hash=}, task_id=7bbd90284b71e599df5a1a8201000000, job_id=01000000, num_args=0, num_returns=2, actor_task_
spec={actor_id=df5a1a8201000000, actor_caller_id=ffffffffffffffffffffffff01000000, actor_counter=1}
2020-09-01 14:05:23,466 WARNING worker.py:1067 -- A worker died or was killed while executing task ffffffffffffffffdf5a1a8201000000.
(pid=3047) 2020-09-01 14:05:23,535      ERROR worker.py:372 -- SystemExit was raised from the worker
(pid=3047) Traceback (most recent call last):
(pid=3047)   File "python/ray/_raylet.pyx", line 549, in ray._raylet.task_execution_handler
(pid=3047)     execute_task(task_type, ray_function, c_resources, c_args,
(pid=3047)   File "python/ray/_raylet.pyx", line 436, in ray._raylet.execute_task
(pid=3047)     with core_worker.profile_event(b"task", extra_data=extra_data):
(pid=3047)   File "python/ray/_raylet.pyx", line 476, in ray._raylet.execute_task
(pid=3047)     with core_worker.profile_event(b"task:execute"):
(pid=3047)   File "python/ray/_raylet.pyx", line 479, in ray._raylet.execute_task
(pid=3047)     switch_worker_log_if_needed(worker, job_id)
(pid=3047)   File "python/ray/_raylet.pyx", line 339, in ray._raylet.switch_worker_log_if_needed
(pid=3047)     ray.worker.set_log_file(job_stdout_path, job_stderr_path)
(pid=3047)   File "/home/swang/anaconda3/envs/ray-wheel/lib/python3.7/site-packages/ray/worker.py", line 903, in set_log_file
(pid=3047)     _set_log_file(stderr_name, worker_pid, sys.stderr, stderr_setter)
(pid=3047)   File "/home/swang/anaconda3/envs/ray-wheel/lib/python3.7/site-packages/ray/worker.py", line 869, in _set_log_file
(pid=3047)     setter_func(open_log(fileno, unbuffered=True, closefd=False))
(pid=3047)   File "/home/swang/anaconda3/envs/ray-wheel/lib/python3.7/site-packages/ray/utils.py", line 433, in open_log
(pid=3047)     stream = open(path, **kwargs)
(pid=3047)   File "/home/swang/anaconda3/envs/ray-wheel/lib/python3.7/codecs.py", line 186, in __init__
(pid=3047)     def __init__(self, errors='strict'):
(pid=3047)   File "/home/swang/anaconda3/envs/ray-wheel/lib/python3.7/site-packages/ray/worker.py", line 369, in sigterm_handler
(pid=3047)     sys.exit(1)
(pid=3047) SystemExit: 1
(pid=raylet) F0901 14:05:23.537029  2971  2971 node_manager.cc:2557]  Check failed: local_queues_.RemoveTask(task_id, &task) ffffffffffffffffbdff035801000000
(pid=raylet) *** Check failure stack trace: ***
(pid=raylet)     @     0x5623d4527b3d  google::LogMessage::Fail()
(pid=raylet)     @     0x5623d4528c9c  google::LogMessage::SendToLog()
(pid=raylet)     @     0x5623d4527819  google::LogMessage::Flush()
(pid=raylet)     @     0x5623d4527a31  google::LogMessage::~LogMessage()
(pid=raylet)     @     0x5623d44e0139  ray::RayLog::~RayLog()
(pid=raylet)     @     0x5623d41f2ccf  ray::raylet::NodeManager::FinishAssignedTask()
(pid=raylet)     @     0x5623d4206544  ray::raylet::NodeManager::HandleWorkerAvailable()
(pid=raylet)     @     0x5623d420662f  ray::raylet::NodeManager::HandleWorkerAvailable()
(pid=raylet)     @     0x5623d420b37b  ray::raylet::NodeManager::ProcessClientMessage()
(pid=raylet)     @     0x5623d4181ba0  _ZNSt17_Function_handlerIFvSt10shared_ptrIN3ray16ClientConnectionEElRKSt6vectorIhSaIhEEEZNS1_6raylet6Raylet12HandleAcceptERKN5boost6system10error_codeE
EUlS3_lS8_E0_E9_M_invokeERKSt9_Any_dataS3_lS8_
(pid=raylet)     @     0x5623d44c64ce  ray::ClientConnection::ProcessMessage()
(pid=raylet)     @     0x5623d44c33ba  boost::asio::detail::read_op<>::operator()()
(pid=raylet)     @     0x5623d44c4402  boost::asio::detail::reactive_socket_recv_op<>::do_complete()
(pid=raylet)     @     0x5623d48252ef  boost::asio::detail::scheduler::do_run_one()
(pid=raylet)     @     0x5623d48267f1  boost::asio::detail::scheduler::run()
(pid=raylet)     @     0x5623d4827822  boost::asio::io_context::run()
(pid=raylet)     @     0x5623d415535e  main
(pid=raylet)     @     0x7fb227508b97  __libc_start_main
(pid=raylet)     @     0x5623d416bf31  (unknown)


(pid=3048) E0901 14:05:29.396008  3048  3168 task_manager.cc:323] Task failed: IOError: cancelling all pending tasks of dead actor: Type=ACTOR_TASK, Language=PYTHON, Resources: {CPU: 1, }, f
unction_descriptor={type=PythonFunctionDescriptor, module_name=__main__, class_name=Actor, function_name=ready, function_hash=}, task_id=150a9d56b40e3700bdff035801000000, job_id=01000000, nu
m_args=0, num_returns=2, actor_task_spec={actor_id=bdff035801000000, actor_caller_id=ffffffffffffffffdf5a1a8201000000, actor_counter=0}
(pid=3048) Traceback (most recent call last):
(pid=3048)   File "python/ray/_raylet.pyx", line 476, in ray._raylet.execute_task
(pid=3048)     with core_worker.profile_event(b"task:execute"):
(pid=3048)   File "python/ray/_raylet.pyx", line 480, in ray._raylet.execute_task
(pid=3048)     with ray.worker._changeproctitle(title, next_title):
(pid=3048)   File "python/ray/_raylet.pyx", line 481, in ray._raylet.execute_task
(pid=3048)     outputs = function_executor(*args, **kwargs)
(pid=3048)   File "python/ray/_raylet.pyx", line 434, in ray._raylet.execute_task.function_executor
(pid=3048)     return function(actor, *arguments, **kwarguments)
(pid=3048)   File "/home/swang/anaconda3/envs/ray-wheel/lib/python3.7/site-packages/ray/function_manager.py", line 553, in actor_method_executor
(pid=3048)     return method(actor, *args, **kwargs)
(pid=3048)   File "test_actor.py", line 18, in ready
(pid=3048)     return ray.get(self.child.ready.remote())
(pid=3048)   File "/home/swang/anaconda3/envs/ray-wheel/lib/python3.7/site-packages/ray/worker.py", line 1425, in get
(pid=3048)     raise value
(pid=3048) ray.exceptions.RayActorError: The actor died unexpectedly before finishing this task.
(pid=3048)
(pid=3048) During handling of the above exception, another exception occurred:
(pid=3048)
(pid=3048) Traceback (most recent call last):
(pid=3048)   File "python/ray/_raylet.pyx", line 549, in ray._raylet.task_execution_handler
(pid=3048)     execute_task(task_type, ray_function, c_resources, c_args,
(pid=3048)   File "python/ray/_raylet.pyx", line 436, in ray._raylet.execute_task
(pid=3048)     with core_worker.profile_event(b"task", extra_data=extra_data):
(pid=3048)   File "python/ray/_raylet.pyx", line 516, in ray._raylet.execute_task
(pid=3048)     ray.utils.push_error_to_driver(
(pid=3048)   File "/home/swang/anaconda3/envs/ray-wheel/lib/python3.7/site-packages/ray/utils.py", line 95, in push_error_to_driver
(pid=3048)     worker.core_worker.push_error(job_id, error_type, message, time.time())
(pid=3048)   File "python/ray/_raylet.pyx", line 1441, in ray._raylet.CoreWorker.push_error
(pid=3048)     with nogil:
(pid=3048)   File "python/ray/_raylet.pyx", line 150, in ray._raylet.check_status
(pid=3048)     raise RayletError(message)
(pid=3048) ray.exceptions.RaySystemError: System error: Broken pipe
(pid=3048)
(pid=3048) During handling of the above exception, another exception occurred:
(pid=3048)
(pid=3048) Traceback (most recent call last):
(pid=3048)   File "/home/swang/anaconda3/envs/ray-wheel/lib/python3.7/site-packages/ray/utils.py", line 95, in push_error_to_driver
(pid=3048)     worker.core_worker.push_error(job_id, error_type, message, time.time())
(pid=3048)   File "python/ray/_raylet.pyx", line 1441, in ray._raylet.CoreWorker.push_error
(pid=3048)     with nogil:
(pid=3048)   File "python/ray/_raylet.pyx", line 150, in ray._raylet.check_status
(pid=3048)     raise RayletError(message)
(pid=3048) ray.exceptions.RaySystemError: System error: Broken pipe
(pid=3048) Exception ignored in: 'ray._raylet.task_execution_handler'
(pid=3048) Traceback (most recent call last):
(pid=3048)   File "/home/swang/anaconda3/envs/ray-wheel/lib/python3.7/site-packages/ray/utils.py", line 95, in push_error_to_driver
(pid=3048)     worker.core_worker.push_error(job_id, error_type, message, time.time())
(pid=3048)   File "python/ray/_raylet.pyx", line 1441, in ray._raylet.CoreWorker.push_error
(pid=3048)     with nogil:
(pid=3048)   File "python/ray/_raylet.pyx", line 150, in ray._raylet.check_status
(pid=3048)     raise RayletError(message)
(pid=3048) ray.exceptions.RaySystemError: System error: Broken pipe
Traceback (most recent call last):
  File "test_actor.py", line 27, in <module>
    print("ready", ray.get(p.ready.remote()))
  File "/home/swang/anaconda3/envs/ray-wheel/lib/python3.7/site-packages/ray/worker.py", line 1423, in get
    raise value.as_instanceof_cause()
ray.exceptions.RayTaskError: ray::Parent.ready() (pid=3048, ip=192.168.1.46)
  File "python/ray/_raylet.pyx", line 480, in ray._raylet.execute_task
    with ray.worker._changeproctitle(title, next_title):
  File "python/ray/_raylet.pyx", line 481, in ray._raylet.execute_task
    outputs = function_executor(*args, **kwargs)
  File "python/ray/_raylet.pyx", line 434, in ray._raylet.execute_task.function_executor
    return function(actor, *arguments, **kwarguments)
  File "test_actor.py", line 18, in ready
    return ray.get(self.child.ready.remote())
ray.exceptions.RayActorError: The actor died unexpectedly before finishing this task.
E0901 14:05:29.990612  2936  2936 raylet_client.cc:130] IOError: Broken pipe [RayletClient] Failed to disconnect from raylet.

I have verified my script runs in a clean environment and reproduces the issue.
I have verified the issue also occurs with the latest wheels.

Issue Analytics

State:
Created 3 years ago
Comments:10 (10 by maintainers)

Top GitHub Comments

2reactions

stephanie-wangcommented, Sep 2, 2020

I think it would be best to just let the child’s parent restart it. The problem is that this can also happen when a non-actor task creates an actor, and the GCS has no visibility into whether that task will get restarted or not.

To make the state transitions in the GCS simpler, I think we would need to key the actor by (actor ID, epoch) instead of just (actor ID). But yeah, I don’t think we can support this in time for 1.0. Probably requires more design.

2reactions

stephanie-wangcommented, Sep 2, 2020

I have some idea about what the bug is, but I’m not sure about a simple fix. I think the current sequence of events is:

Parent P creates child actor with ID C. GCS creates C.
P dies => GCS marks C as DEAD.
P gets restarted and resubmits C’s creation task to the GCS. (I’m not sure what the GCS does with this message, maybe it drops it?)
P submits the task to C again and subscribes to C’s location. It receives the GCS DEAD entry and fails all tasks.

I tried modifying the source code so that C’s ID is generated randomly instead of deterministically, based on P’s ID. We could also do something similar where we add an epoch number to each actor that gets incremented each time the actor’s owner restarts. The random ID generation fixed this particular script, but it won’t work for the case where C was started with max_task_retries != 0. In that case, anyone that already has the handle to C will somehow need to wait and learn the new actor ID even though they think the actor is DEAD.

We may want to just focus on cases where C is not started with automatic task retries, since those cases are significantly easier to support. If automatic task retries is enabled, I think we’ll have to either figure out a way for the ref holder to decide when the actor’s owner will never restart again, or it could just fate-share with the actor’s owner. Seems like we need to first understand when these cases might come up.