question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

[rllib] Actor died unexpectedly before finishing this task.

See original GitHub issue

System information

  • OS Platform and Distribution (e.g., Linux Ubuntu 16.04): Ubuntu 18.04
  • Ray installed from (source or binary): pip
  • Ray version: 0.7.5
  • Python version: 3.6.8
  • Exact command to reproduce: ray.get(future) (see code below)

Describe the problem

Apologies if this issue is because I’m doing something wrong.

I’m trying to use Global Coordination to make a stateful policy mapping function. I’ve found that if I try to get the value of a future from a remote object in either a RLlib callback, policy mapping function or custom train function, Ray/the actor dies with ray.exceptions.RayActorError: The actor died unexpectedly before finishing this task.

Source code / logs

This minimal example should be all you need to reproduce.

from gym.envs.classic_control import CartPoleEnv
import ray
from ray import tune
from ray.experimental import named_actors

@ray.remote
class Counter:
    def __init__(self):
        self.count = 0
    def inc(self, n):
        self.count += n
    def get(self):
        return self.count

class CountingCartpole(CartPoleEnv):
    def step(self, action):
        counter = named_actors.get_actor("global_counter")
        counter.inc.remote(1)
        return super().step(action)

def on_episode_end(info):
    counter = named_actors.get_actor("global_counter")
    future = counter.get.remote()
    count = ray.get(future)
    print(count)

if __name__ == "__main__":
    ray.init()
    named_actors.register_actor("global_counter", Counter.remote())
    tune.register_env("counting_cartpole", lambda _: CountingCartpole())
    trials = tune.run(
        "PG",
        stop={"training_iteration": 10},
        config={
            "env": "counting_cartpole",
            "callbacks": {"on_episode_end": on_episode_end},
        }
    )
(pid=68066) 2019-10-02 16:40:10,472	INFO tf_run_builder.py:92 -- Executing TF run without tracing. To dump TF timeline traces to disk, set the TF_TIMELINE_DIR environment variable.
2019-10-02 16:40:10,517	ERROR trial_runner.py:560 -- Error processing event.
Traceback (most recent call last):
  File "/home/dave/projects/test_rllib/venv/lib/python3.6/site-packages/ray/tune/trial_runner.py", line 506, in _process_trial
    result = self.trial_executor.fetch_result(trial)
  File "/home/dave/projects/test_rllib/venv/lib/python3.6/site-packages/ray/tune/ray_trial_executor.py", line 347, in fetch_result
    result = ray.get(trial_future[0])
  File "/home/dave/projects/test_rllib/venv/lib/python3.6/site-packages/ray/worker.py", line 2349, in get
    raise value
ray.exceptions.RayTaskError: ray_PG:train() (pid=68066, host=dave)
  File "/home/dave/projects/test_rllib/venv/lib/python3.6/site-packages/ray/rllib/agents/trainer.py", line 417, in train
    raise e
  File "/home/dave/projects/test_rllib/venv/lib/python3.6/site-packages/ray/rllib/agents/trainer.py", line 406, in train
    result = Trainable.train(self)
  File "/home/dave/projects/test_rllib/venv/lib/python3.6/site-packages/ray/tune/trainable.py", line 176, in train
    result = self._train()
  File "/home/dave/projects/test_rllib/venv/lib/python3.6/site-packages/ray/rllib/agents/trainer_template.py", line 129, in _train
    fetches = self.optimizer.step()
  File "/home/dave/projects/test_rllib/venv/lib/python3.6/site-packages/ray/rllib/optimizers/sync_samples_optimizer.py", line 70, in step
    samples.append(self.workers.local_worker().sample())
  File "/home/dave/projects/test_rllib/venv/lib/python3.6/site-packages/ray/rllib/evaluation/rollout_worker.py", line 469, in sample
    batches = [self.input_reader.next()]
  File "/home/dave/projects/test_rllib/venv/lib/python3.6/site-packages/ray/rllib/evaluation/sampler.py", line 56, in next
    batches = [self.get_data()]
  File "/home/dave/projects/test_rllib/venv/lib/python3.6/site-packages/ray/rllib/evaluation/sampler.py", line 99, in get_data
    item = next(self.rollout_provider)
  File "/home/dave/projects/test_rllib/venv/lib/python3.6/site-packages/ray/rllib/evaluation/sampler.py", line 319, in _env_runner
    soft_horizon, no_done_at_end)
  File "/home/dave/projects/test_rllib/venv/lib/python3.6/site-packages/ray/rllib/evaluation/sampler.py", line 473, in _process_observations
    "episode": episode
  File "/home/dave/projects/test_rllib/test.py", line 29, in on_episode_end
    count = ray.get(future)
ray.exceptions.RayActorError: The actor died unexpectedly before finishing this task.

Issue Analytics

  • State:closed
  • Created 4 years ago
  • Reactions:2
  • Comments:10 (5 by maintainers)

github_iconTop GitHub Comments

1reaction
stephanie-wangcommented, Oct 8, 2019

Thanks, @davidcotton! The issue that you’re getting has to do with the way Ray maintains handles to named actors. Hopefully that will be fixed with #5814.

For your particular application, you can actually fix it by saving the actor handle directly in the environment and reusing it later on in the callback. I’d recommend using this kind of pattern wherever possible instead of named actors. Here’s some example code:

from gym.envs.classic_control import CartPoleEnv
import ray
from ray import tune


@ray.remote
class Counter:
    def __init__(self):
        self.count = 0
    def inc(self, n):
        self.count += n
    def get(self):
        return self.count

class CountingCartpole(CartPoleEnv):
    def __init__(self, counter):
        super().__init__()
        self.counter = counter

    def step(self, action):
        self.counter.inc.remote(1)
        return super().step(action)

def on_episode_end(info):
    # Get the counter that was saved in the env instead of the named actor.
    env = info["env"].get_unwrapped()[0]
    future = env.counter.get.remote()
    count = ray.get(future)
    print(count)

if __name__ == "__main__":
    ray.init()
    # Need to keep around c to prevent the actor from exiting.
    c = Counter.remote()
    tune.register_env("counting_cartpole", lambda _: CountingCartpole(c))
    trials = tune.run(
        "PG",
        stop={"training_iteration": 100},
        config={
            "env": "counting_cartpole",
            "callbacks": {"on_episode_end": on_episode_end},
        }
    )
0reactions
aidanmclaughlincommented, Nov 1, 2022

@janblumenkamp that issue persists for me still. I have a 20-thread processor, but when num_workers > 2, I get this error. Any progress?

Read more comments on GitHub >

github_iconTop Results From Across the Web

The actor died unexpectedly before finishing this task ( Ray1 ...
I am running Ray rllib on sagemaker with 8 cores CPU using the sagemaker_rl library, I set num_workers to 7.
Read more >
The actor died unexpectedly before finishing this task - Ray
Hello folks,. I'm using Ray version 1.3.0 to do data partition and to exchange partitions between nodes with following code:
Read more >
ray Changelog - pyup.io
Set actor died error message in ActorDiedError 20903 - Event stats is enabled by default 21515 Fixes - Better support for nested tasks...
Read more >
Ray Documentation - Read the Docs
tor should be reconstructed when it dies unexpectedly. The minimum valid value is 0 ... not release it when the task finishes executing....
Read more >
Using Pytest Patch Decorator To Test Ray Actors Remote Function
.[rllib] Frequent the actor died unexpectedly before finishing this task errors with executions ops in Ray/RLLib 0.8.7+ #11239. Right Way to Test Mock...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found