[rllib] Actor died unexpectedly before finishing this task.
See original GitHub issueSystem information
- OS Platform and Distribution (e.g., Linux Ubuntu 16.04): Ubuntu 18.04
- Ray installed from (source or binary): pip
- Ray version: 0.7.5
- Python version: 3.6.8
- Exact command to reproduce:
ray.get(future)
(see code below)
Describe the problem
Apologies if this issue is because I’m doing something wrong.
I’m trying to use Global Coordination to make a stateful policy mapping function. I’ve found that if I try to get the value of a future from a remote object in either a RLlib callback, policy mapping function or custom train function, Ray/the actor dies with ray.exceptions.RayActorError: The actor died unexpectedly before finishing this task.
Source code / logs
This minimal example should be all you need to reproduce.
from gym.envs.classic_control import CartPoleEnv
import ray
from ray import tune
from ray.experimental import named_actors
@ray.remote
class Counter:
def __init__(self):
self.count = 0
def inc(self, n):
self.count += n
def get(self):
return self.count
class CountingCartpole(CartPoleEnv):
def step(self, action):
counter = named_actors.get_actor("global_counter")
counter.inc.remote(1)
return super().step(action)
def on_episode_end(info):
counter = named_actors.get_actor("global_counter")
future = counter.get.remote()
count = ray.get(future)
print(count)
if __name__ == "__main__":
ray.init()
named_actors.register_actor("global_counter", Counter.remote())
tune.register_env("counting_cartpole", lambda _: CountingCartpole())
trials = tune.run(
"PG",
stop={"training_iteration": 10},
config={
"env": "counting_cartpole",
"callbacks": {"on_episode_end": on_episode_end},
}
)
(pid=68066) 2019-10-02 16:40:10,472 INFO tf_run_builder.py:92 -- Executing TF run without tracing. To dump TF timeline traces to disk, set the TF_TIMELINE_DIR environment variable.
2019-10-02 16:40:10,517 ERROR trial_runner.py:560 -- Error processing event.
Traceback (most recent call last):
File "/home/dave/projects/test_rllib/venv/lib/python3.6/site-packages/ray/tune/trial_runner.py", line 506, in _process_trial
result = self.trial_executor.fetch_result(trial)
File "/home/dave/projects/test_rllib/venv/lib/python3.6/site-packages/ray/tune/ray_trial_executor.py", line 347, in fetch_result
result = ray.get(trial_future[0])
File "/home/dave/projects/test_rllib/venv/lib/python3.6/site-packages/ray/worker.py", line 2349, in get
raise value
ray.exceptions.RayTaskError: ray_PG:train() (pid=68066, host=dave)
File "/home/dave/projects/test_rllib/venv/lib/python3.6/site-packages/ray/rllib/agents/trainer.py", line 417, in train
raise e
File "/home/dave/projects/test_rllib/venv/lib/python3.6/site-packages/ray/rllib/agents/trainer.py", line 406, in train
result = Trainable.train(self)
File "/home/dave/projects/test_rllib/venv/lib/python3.6/site-packages/ray/tune/trainable.py", line 176, in train
result = self._train()
File "/home/dave/projects/test_rllib/venv/lib/python3.6/site-packages/ray/rllib/agents/trainer_template.py", line 129, in _train
fetches = self.optimizer.step()
File "/home/dave/projects/test_rllib/venv/lib/python3.6/site-packages/ray/rllib/optimizers/sync_samples_optimizer.py", line 70, in step
samples.append(self.workers.local_worker().sample())
File "/home/dave/projects/test_rllib/venv/lib/python3.6/site-packages/ray/rllib/evaluation/rollout_worker.py", line 469, in sample
batches = [self.input_reader.next()]
File "/home/dave/projects/test_rllib/venv/lib/python3.6/site-packages/ray/rllib/evaluation/sampler.py", line 56, in next
batches = [self.get_data()]
File "/home/dave/projects/test_rllib/venv/lib/python3.6/site-packages/ray/rllib/evaluation/sampler.py", line 99, in get_data
item = next(self.rollout_provider)
File "/home/dave/projects/test_rllib/venv/lib/python3.6/site-packages/ray/rllib/evaluation/sampler.py", line 319, in _env_runner
soft_horizon, no_done_at_end)
File "/home/dave/projects/test_rllib/venv/lib/python3.6/site-packages/ray/rllib/evaluation/sampler.py", line 473, in _process_observations
"episode": episode
File "/home/dave/projects/test_rllib/test.py", line 29, in on_episode_end
count = ray.get(future)
ray.exceptions.RayActorError: The actor died unexpectedly before finishing this task.
Issue Analytics
- State:
- Created 4 years ago
- Reactions:2
- Comments:10 (5 by maintainers)
Top Results From Across the Web
The actor died unexpectedly before finishing this task ( Ray1 ...
I am running Ray rllib on sagemaker with 8 cores CPU using the sagemaker_rl library, I set num_workers to 7.
Read more >The actor died unexpectedly before finishing this task - Ray
Hello folks,. I'm using Ray version 1.3.0 to do data partition and to exchange partitions between nodes with following code:
Read more >ray Changelog - pyup.io
Set actor died error message in ActorDiedError 20903 - Event stats is enabled by default 21515 Fixes - Better support for nested tasks...
Read more >Ray Documentation - Read the Docs
tor should be reconstructed when it dies unexpectedly. The minimum valid value is 0 ... not release it when the task finishes executing....
Read more >Using Pytest Patch Decorator To Test Ray Actors Remote Function
.[rllib] Frequent the actor died unexpectedly before finishing this task errors with executions ops in Ray/RLLib 0.8.7+ #11239. Right Way to Test Mock...
Read more >Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start FreeTop Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Top GitHub Comments
Thanks, @davidcotton! The issue that you’re getting has to do with the way Ray maintains handles to named actors. Hopefully that will be fixed with #5814.
For your particular application, you can actually fix it by saving the actor handle directly in the environment and reusing it later on in the callback. I’d recommend using this kind of pattern wherever possible instead of named actors. Here’s some example code:
@janblumenkamp that issue persists for me still. I have a 20-thread processor, but when num_workers > 2, I get this error. Any progress?