Cannot recreate a child actor
See original GitHub issueWhat is the problem?
Ray version and other system information (Python version, TensorFlow version, OS): Ray 0.9dev
The problem occurs for tasks/actors that are automatically restarted by Ray on a crash. When the task or actor creates a child actor, then crashes, I expect the child actor to get recreated when the parent task/actor is restarted. Instead, any tasks submitted to the child actor fail with a RayActorError.
Reproduction (REQUIRED)
I tried two different versions of the script. One has the default @ray.remote decorator for the child actor, the other has the options (max_restarts=-1, max_task_retries=-1
).
import ray
import time
import os
@ray.remote
class Actor:
def __init__(self):
pass
def ready(self):
return
@ray.remote(max_restarts=-1, max_task_retries=-1)
class Parent:
def __init__(self):
self.child = Actor.remote()
def ready(self):
return ray.get(self.child.ready.remote())
def pid(self):
return os.getpid()
ray.init()
p = Parent.remote()
pid = ray.get(p.pid.remote())
os.kill(pid, 9)
print("ready", ray.get(p.ready.remote()))
Output:
2020-09-01 14:04:57,259 INFO resource_spec.py:250 -- Starting Ray with 4.35 GiB memory available for workers and up to 2.18 GiB for objects. You can adjust these settings with ray.init(memory=<bytes>, object_store_memory=<bytes>).
2020-09-01 14:04:57,732 INFO services.py:1201 -- View the Ray dashboard at 127.0.0.1:8265
E0901 14:04:59.840095 2653 2892 task_manager.cc:304] infinite retries left for task 7bbd90284b71e599df5a1a8201000000, attempting to resubmit.
E0901 14:04:59.840198 2653 2892 core_worker.cc:410] Will resubmit task after a 5000ms delay: Type=ACTOR_TASK, Language=PYTHON, Resources: {CPU: 1, }, function_descriptor={type=PythonFunctionDescriptor, module_name=__main__, class_name=Parent, function_name=ready, function_hash=}, task_id=7bbd90284b71e599df5a1a8201000000, job_id=01000000, num_args=0, num_returns=2, actor_task_spec={actor_id=df5a1a8201000000, actor_caller_id=ffffffffffffffffffffffff01000000, actor_counter=1}
2020-09-01 14:04:59,850 WARNING worker.py:1067 -- A worker died or was killed while executing task ffffffffffffffffdf5a1a8201000000.
Traceback (most recent call last):
File "test_actor.py", line 26, in <module>
print("ready", ray.get(p.ready.remote()))
File "/home/swang/anaconda3/envs/ray-wheel/lib/python3.7/site-packages/ray/worker.py", line 1423, in get
raise value.as_instanceof_cause()
ray.exceptions.RayTaskError: ray::Parent.ready() (pid=2743, ip=192.168.1.46)
File "python/ray/_raylet.pyx", line 480, in ray._raylet.execute_task
with ray.worker._changeproctitle(title, next_title):
File "python/ray/_raylet.pyx", line 481, in ray._raylet.execute_task
outputs = function_executor(*args, **kwargs)
File "python/ray/_raylet.pyx", line 434, in ray._raylet.execute_task.function_executor
return function(actor, *arguments, **kwarguments)
File "test_actor.py", line 17, in ready
return ray.get(self.child.ready.remote())
ray.exceptions.RayActorError: The actor died unexpectedly before finishing this task.
(pid=2743) E0901 14:05:05.768839 2743 2827 task_manager.cc:323] Task failed: IOError: cancelling all pending tasks of dead actor: Type=ACTOR_TASK, Language=PYTHON, Resources: {CPU: 1, }, function_descriptor={type=PythonFunctionDescriptor, module_name=__main__, class_name=Actor, function_name=ready, function_hash=}, task_id=150a9d56b40e3700bdff035801000000, job_id=01000000, num_args=0, num_returns=2, actor_task_spec={actor_id=bdff035801000000, actor_caller_id=ffffffffffffffffdf5a1a8201000000, actor_counter=0}
Output with @ray.remote(max_restarts=-1, max_task_retries=-1)
for the child actor:
2020-09-01 14:05:20,903 INFO resource_spec.py:250 -- Starting Ray with 4.35 GiB memory available for workers and up to 2.17 GiB for objects. You can adjust these settings with ray.init(memory=<bytes>, object_store_memory=<bytes>).
2020-09-01 14:05:21,350 INFO services.py:1201 -- View the Ray dashboard at 127.0.0.1:8265
E0901 14:05:23.458299 2936 3212 task_manager.cc:304] infinite retries left for task 7bbd90284b71e599df5a1a8201000000, attempting to resubmit.
E0901 14:05:23.458393 2936 3212 core_worker.cc:410] Will resubmit task after a 5000ms delay: Type=ACTOR_TASK, Language=PYTHON, Resources: {CPU: 1, }, function_descriptor={type=PythonFunctionDescriptor, module_name=__main__, class_name=Parent, function_name=ready, function_hash=}, task_id=7bbd90284b71e599df5a1a8201000000, job_id=01000000, num_args=0, num_returns=2, actor_task_
spec={actor_id=df5a1a8201000000, actor_caller_id=ffffffffffffffffffffffff01000000, actor_counter=1}
2020-09-01 14:05:23,466 WARNING worker.py:1067 -- A worker died or was killed while executing task ffffffffffffffffdf5a1a8201000000.
(pid=3047) 2020-09-01 14:05:23,535 ERROR worker.py:372 -- SystemExit was raised from the worker
(pid=3047) Traceback (most recent call last):
(pid=3047) File "python/ray/_raylet.pyx", line 549, in ray._raylet.task_execution_handler
(pid=3047) execute_task(task_type, ray_function, c_resources, c_args,
(pid=3047) File "python/ray/_raylet.pyx", line 436, in ray._raylet.execute_task
(pid=3047) with core_worker.profile_event(b"task", extra_data=extra_data):
(pid=3047) File "python/ray/_raylet.pyx", line 476, in ray._raylet.execute_task
(pid=3047) with core_worker.profile_event(b"task:execute"):
(pid=3047) File "python/ray/_raylet.pyx", line 479, in ray._raylet.execute_task
(pid=3047) switch_worker_log_if_needed(worker, job_id)
(pid=3047) File "python/ray/_raylet.pyx", line 339, in ray._raylet.switch_worker_log_if_needed
(pid=3047) ray.worker.set_log_file(job_stdout_path, job_stderr_path)
(pid=3047) File "/home/swang/anaconda3/envs/ray-wheel/lib/python3.7/site-packages/ray/worker.py", line 903, in set_log_file
(pid=3047) _set_log_file(stderr_name, worker_pid, sys.stderr, stderr_setter)
(pid=3047) File "/home/swang/anaconda3/envs/ray-wheel/lib/python3.7/site-packages/ray/worker.py", line 869, in _set_log_file
(pid=3047) setter_func(open_log(fileno, unbuffered=True, closefd=False))
(pid=3047) File "/home/swang/anaconda3/envs/ray-wheel/lib/python3.7/site-packages/ray/utils.py", line 433, in open_log
(pid=3047) stream = open(path, **kwargs)
(pid=3047) File "/home/swang/anaconda3/envs/ray-wheel/lib/python3.7/codecs.py", line 186, in __init__
(pid=3047) def __init__(self, errors='strict'):
(pid=3047) File "/home/swang/anaconda3/envs/ray-wheel/lib/python3.7/site-packages/ray/worker.py", line 369, in sigterm_handler
(pid=3047) sys.exit(1)
(pid=3047) SystemExit: 1
(pid=raylet) F0901 14:05:23.537029 2971 2971 node_manager.cc:2557] Check failed: local_queues_.RemoveTask(task_id, &task) ffffffffffffffffbdff035801000000
(pid=raylet) *** Check failure stack trace: ***
(pid=raylet) @ 0x5623d4527b3d google::LogMessage::Fail()
(pid=raylet) @ 0x5623d4528c9c google::LogMessage::SendToLog()
(pid=raylet) @ 0x5623d4527819 google::LogMessage::Flush()
(pid=raylet) @ 0x5623d4527a31 google::LogMessage::~LogMessage()
(pid=raylet) @ 0x5623d44e0139 ray::RayLog::~RayLog()
(pid=raylet) @ 0x5623d41f2ccf ray::raylet::NodeManager::FinishAssignedTask()
(pid=raylet) @ 0x5623d4206544 ray::raylet::NodeManager::HandleWorkerAvailable()
(pid=raylet) @ 0x5623d420662f ray::raylet::NodeManager::HandleWorkerAvailable()
(pid=raylet) @ 0x5623d420b37b ray::raylet::NodeManager::ProcessClientMessage()
(pid=raylet) @ 0x5623d4181ba0 _ZNSt17_Function_handlerIFvSt10shared_ptrIN3ray16ClientConnectionEElRKSt6vectorIhSaIhEEEZNS1_6raylet6Raylet12HandleAcceptERKN5boost6system10error_codeE
EUlS3_lS8_E0_E9_M_invokeERKSt9_Any_dataS3_lS8_
(pid=raylet) @ 0x5623d44c64ce ray::ClientConnection::ProcessMessage()
(pid=raylet) @ 0x5623d44c33ba boost::asio::detail::read_op<>::operator()()
(pid=raylet) @ 0x5623d44c4402 boost::asio::detail::reactive_socket_recv_op<>::do_complete()
(pid=raylet) @ 0x5623d48252ef boost::asio::detail::scheduler::do_run_one()
(pid=raylet) @ 0x5623d48267f1 boost::asio::detail::scheduler::run()
(pid=raylet) @ 0x5623d4827822 boost::asio::io_context::run()
(pid=raylet) @ 0x5623d415535e main
(pid=raylet) @ 0x7fb227508b97 __libc_start_main
(pid=raylet) @ 0x5623d416bf31 (unknown)
(pid=3048) E0901 14:05:29.396008 3048 3168 task_manager.cc:323] Task failed: IOError: cancelling all pending tasks of dead actor: Type=ACTOR_TASK, Language=PYTHON, Resources: {CPU: 1, }, f
unction_descriptor={type=PythonFunctionDescriptor, module_name=__main__, class_name=Actor, function_name=ready, function_hash=}, task_id=150a9d56b40e3700bdff035801000000, job_id=01000000, nu
m_args=0, num_returns=2, actor_task_spec={actor_id=bdff035801000000, actor_caller_id=ffffffffffffffffdf5a1a8201000000, actor_counter=0}
(pid=3048) Traceback (most recent call last):
(pid=3048) File "python/ray/_raylet.pyx", line 476, in ray._raylet.execute_task
(pid=3048) with core_worker.profile_event(b"task:execute"):
(pid=3048) File "python/ray/_raylet.pyx", line 480, in ray._raylet.execute_task
(pid=3048) with ray.worker._changeproctitle(title, next_title):
(pid=3048) File "python/ray/_raylet.pyx", line 481, in ray._raylet.execute_task
(pid=3048) outputs = function_executor(*args, **kwargs)
(pid=3048) File "python/ray/_raylet.pyx", line 434, in ray._raylet.execute_task.function_executor
(pid=3048) return function(actor, *arguments, **kwarguments)
(pid=3048) File "/home/swang/anaconda3/envs/ray-wheel/lib/python3.7/site-packages/ray/function_manager.py", line 553, in actor_method_executor
(pid=3048) return method(actor, *args, **kwargs)
(pid=3048) File "test_actor.py", line 18, in ready
(pid=3048) return ray.get(self.child.ready.remote())
(pid=3048) File "/home/swang/anaconda3/envs/ray-wheel/lib/python3.7/site-packages/ray/worker.py", line 1425, in get
(pid=3048) raise value
(pid=3048) ray.exceptions.RayActorError: The actor died unexpectedly before finishing this task.
(pid=3048)
(pid=3048) During handling of the above exception, another exception occurred:
(pid=3048)
(pid=3048) Traceback (most recent call last):
(pid=3048) File "python/ray/_raylet.pyx", line 549, in ray._raylet.task_execution_handler
(pid=3048) execute_task(task_type, ray_function, c_resources, c_args,
(pid=3048) File "python/ray/_raylet.pyx", line 436, in ray._raylet.execute_task
(pid=3048) with core_worker.profile_event(b"task", extra_data=extra_data):
(pid=3048) File "python/ray/_raylet.pyx", line 516, in ray._raylet.execute_task
(pid=3048) ray.utils.push_error_to_driver(
(pid=3048) File "/home/swang/anaconda3/envs/ray-wheel/lib/python3.7/site-packages/ray/utils.py", line 95, in push_error_to_driver
(pid=3048) worker.core_worker.push_error(job_id, error_type, message, time.time())
(pid=3048) File "python/ray/_raylet.pyx", line 1441, in ray._raylet.CoreWorker.push_error
(pid=3048) with nogil:
(pid=3048) File "python/ray/_raylet.pyx", line 150, in ray._raylet.check_status
(pid=3048) raise RayletError(message)
(pid=3048) ray.exceptions.RaySystemError: System error: Broken pipe
(pid=3048)
(pid=3048) During handling of the above exception, another exception occurred:
(pid=3048)
(pid=3048) Traceback (most recent call last):
(pid=3048) File "/home/swang/anaconda3/envs/ray-wheel/lib/python3.7/site-packages/ray/utils.py", line 95, in push_error_to_driver
(pid=3048) worker.core_worker.push_error(job_id, error_type, message, time.time())
(pid=3048) File "python/ray/_raylet.pyx", line 1441, in ray._raylet.CoreWorker.push_error
(pid=3048) with nogil:
(pid=3048) File "python/ray/_raylet.pyx", line 150, in ray._raylet.check_status
(pid=3048) raise RayletError(message)
(pid=3048) ray.exceptions.RaySystemError: System error: Broken pipe
(pid=3048) Exception ignored in: 'ray._raylet.task_execution_handler'
(pid=3048) Traceback (most recent call last):
(pid=3048) File "/home/swang/anaconda3/envs/ray-wheel/lib/python3.7/site-packages/ray/utils.py", line 95, in push_error_to_driver
(pid=3048) worker.core_worker.push_error(job_id, error_type, message, time.time())
(pid=3048) File "python/ray/_raylet.pyx", line 1441, in ray._raylet.CoreWorker.push_error
(pid=3048) with nogil:
(pid=3048) File "python/ray/_raylet.pyx", line 150, in ray._raylet.check_status
(pid=3048) raise RayletError(message)
(pid=3048) ray.exceptions.RaySystemError: System error: Broken pipe
Traceback (most recent call last):
File "test_actor.py", line 27, in <module>
print("ready", ray.get(p.ready.remote()))
File "/home/swang/anaconda3/envs/ray-wheel/lib/python3.7/site-packages/ray/worker.py", line 1423, in get
raise value.as_instanceof_cause()
ray.exceptions.RayTaskError: ray::Parent.ready() (pid=3048, ip=192.168.1.46)
File "python/ray/_raylet.pyx", line 480, in ray._raylet.execute_task
with ray.worker._changeproctitle(title, next_title):
File "python/ray/_raylet.pyx", line 481, in ray._raylet.execute_task
outputs = function_executor(*args, **kwargs)
File "python/ray/_raylet.pyx", line 434, in ray._raylet.execute_task.function_executor
return function(actor, *arguments, **kwarguments)
File "test_actor.py", line 18, in ready
return ray.get(self.child.ready.remote())
ray.exceptions.RayActorError: The actor died unexpectedly before finishing this task.
E0901 14:05:29.990612 2936 2936 raylet_client.cc:130] IOError: Broken pipe [RayletClient] Failed to disconnect from raylet.
- I have verified my script runs in a clean environment and reproduces the issue.
- I have verified the issue also occurs with the latest wheels.
Issue Analytics
- State:
- Created 3 years ago
- Comments:10 (10 by maintainers)
Top Results From Across the Web
Child actor not replicating correctly with AttachToComponent
I have a flashlight actor that I spawn on the SERVER via an actor component in a player character. The flashlight is then...
Read more >Prevent akka actor from restarting child actor - Stack Overflow
An Actor is restarted because its internal state has become invalid and cannot be trusted anymore. Since the child actors it creates are...
Read more >Macaulay Culkin Won't Recreate His 'Home Alone' Face
Former child actor Macaulay Culkin is recognized wherever he goes, and admitted to Ellen that he constantly has to deny people when they...
Read more >Five Famous Child Actors Who Failed To Recreate The ...
Aditya Narayan · Aditya Narayan ; Imran Khan · Imran Khan ; Jugal Hansraj · Jugal Hansraj ; Sana Saeed · Sana Saeed...
Read more >Lesson 10: Overview of the supervisor hierarchy in Proto.Actor.
One strategy is to restart the actor from his initial state. There may be situations where the parent actor does not know what...
Read more >
Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free
Top Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
I think it would be best to just let the child’s parent restart it. The problem is that this can also happen when a non-actor task creates an actor, and the GCS has no visibility into whether that task will get restarted or not.
To make the state transitions in the GCS simpler, I think we would need to key the actor by (actor ID, epoch) instead of just (actor ID). But yeah, I don’t think we can support this in time for 1.0. Probably requires more design.
I have some idea about what the bug is, but I’m not sure about a simple fix. I think the current sequence of events is:
I tried modifying the source code so that C’s ID is generated randomly instead of deterministically, based on P’s ID. We could also do something similar where we add an epoch number to each actor that gets incremented each time the actor’s owner restarts. The random ID generation fixed this particular script, but it won’t work for the case where C was started with
max_task_retries != 0
. In that case, anyone that already has the handle to C will somehow need to wait and learn the new actor ID even though they think the actor is DEAD.We may want to just focus on cases where C is not started with automatic task retries, since those cases are significantly easier to support. If automatic task retries is enabled, I think we’ll have to either figure out a way for the ref holder to decide when the actor’s owner will never restart again, or it could just fate-share with the actor’s owner. Seems like we need to first understand when these cases might come up.