Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

[rllib] Actor died unexpectedly before finishing this task.

See original GitHub issue

System information

OS Platform and Distribution (e.g., Linux Ubuntu 16.04): Ubuntu 18.04
Ray installed from (source or binary): pip
Ray version: 0.7.5
Python version: 3.6.8
Exact command to reproduce: ray.get(future) (see code below)

Describe the problem

Apologies if this issue is because I’m doing something wrong.

I’m trying to use Global Coordination to make a stateful policy mapping function. I’ve found that if I try to get the value of a future from a remote object in either a RLlib callback, policy mapping function or custom train function, Ray/the actor dies with ray.exceptions.RayActorError: The actor died unexpectedly before finishing this task.

Source code / logs

This minimal example should be all you need to reproduce.

from gym.envs.classic_control import CartPoleEnv
import ray
from ray import tune
from ray.experimental import named_actors

@ray.remote
class Counter:
    def __init__(self):
        self.count = 0
    def inc(self, n):
        self.count += n
    def get(self):
        return self.count

class CountingCartpole(CartPoleEnv):
    def step(self, action):
        counter = named_actors.get_actor("global_counter")
        counter.inc.remote(1)
        return super().step(action)

def on_episode_end(info):
    counter = named_actors.get_actor("global_counter")
    future = counter.get.remote()
    count = ray.get(future)
    print(count)

if __name__ == "__main__":
    ray.init()
    named_actors.register_actor("global_counter", Counter.remote())
    tune.register_env("counting_cartpole", lambda _: CountingCartpole())
    trials = tune.run(
        "PG",
        stop={"training_iteration": 10},
        config={
            "env": "counting_cartpole",
            "callbacks": {"on_episode_end": on_episode_end},
        }
    )

(pid=68066) 2019-10-02 16:40:10,472	INFO tf_run_builder.py:92 -- Executing TF run without tracing. To dump TF timeline traces to disk, set the TF_TIMELINE_DIR environment variable.
2019-10-02 16:40:10,517	ERROR trial_runner.py:560 -- Error processing event.
Traceback (most recent call last):
  File "/home/dave/projects/test_rllib/venv/lib/python3.6/site-packages/ray/tune/trial_runner.py", line 506, in _process_trial
    result = self.trial_executor.fetch_result(trial)
  File "/home/dave/projects/test_rllib/venv/lib/python3.6/site-packages/ray/tune/ray_trial_executor.py", line 347, in fetch_result
    result = ray.get(trial_future[0])
  File "/home/dave/projects/test_rllib/venv/lib/python3.6/site-packages/ray/worker.py", line 2349, in get
    raise value
ray.exceptions.RayTaskError: ray_PG:train() (pid=68066, host=dave)
  File "/home/dave/projects/test_rllib/venv/lib/python3.6/site-packages/ray/rllib/agents/trainer.py", line 417, in train
    raise e
  File "/home/dave/projects/test_rllib/venv/lib/python3.6/site-packages/ray/rllib/agents/trainer.py", line 406, in train
    result = Trainable.train(self)
  File "/home/dave/projects/test_rllib/venv/lib/python3.6/site-packages/ray/tune/trainable.py", line 176, in train
    result = self._train()
  File "/home/dave/projects/test_rllib/venv/lib/python3.6/site-packages/ray/rllib/agents/trainer_template.py", line 129, in _train
    fetches = self.optimizer.step()
  File "/home/dave/projects/test_rllib/venv/lib/python3.6/site-packages/ray/rllib/optimizers/sync_samples_optimizer.py", line 70, in step
    samples.append(self.workers.local_worker().sample())
  File "/home/dave/projects/test_rllib/venv/lib/python3.6/site-packages/ray/rllib/evaluation/rollout_worker.py", line 469, in sample
    batches = [self.input_reader.next()]
  File "/home/dave/projects/test_rllib/venv/lib/python3.6/site-packages/ray/rllib/evaluation/sampler.py", line 56, in next
    batches = [self.get_data()]
  File "/home/dave/projects/test_rllib/venv/lib/python3.6/site-packages/ray/rllib/evaluation/sampler.py", line 99, in get_data
    item = next(self.rollout_provider)
  File "/home/dave/projects/test_rllib/venv/lib/python3.6/site-packages/ray/rllib/evaluation/sampler.py", line 319, in _env_runner
    soft_horizon, no_done_at_end)
  File "/home/dave/projects/test_rllib/venv/lib/python3.6/site-packages/ray/rllib/evaluation/sampler.py", line 473, in _process_observations
    "episode": episode
  File "/home/dave/projects/test_rllib/test.py", line 29, in on_episode_end
    count = ray.get(future)
ray.exceptions.RayActorError: The actor died unexpectedly before finishing this task.

Issue Analytics

State:
Created 4 years ago
Reactions:2
Comments:10 (5 by maintainers)

Top GitHub Comments

1reaction

stephanie-wangcommented, Oct 8, 2019

Thanks, @davidcotton! The issue that you’re getting has to do with the way Ray maintains handles to named actors. Hopefully that will be fixed with #5814.

For your particular application, you can actually fix it by saving the actor handle directly in the environment and reusing it later on in the callback. I’d recommend using this kind of pattern wherever possible instead of named actors. Here’s some example code:

from gym.envs.classic_control import CartPoleEnv
import ray
from ray import tune


@ray.remote
class Counter:
    def __init__(self):
        self.count = 0
    def inc(self, n):
        self.count += n
    def get(self):
        return self.count

class CountingCartpole(CartPoleEnv):
    def __init__(self, counter):
        super().__init__()
        self.counter = counter

    def step(self, action):
        self.counter.inc.remote(1)
        return super().step(action)

def on_episode_end(info):
    # Get the counter that was saved in the env instead of the named actor.
    env = info["env"].get_unwrapped()[0]
    future = env.counter.get.remote()
    count = ray.get(future)
    print(count)

if __name__ == "__main__":
    ray.init()
    # Need to keep around c to prevent the actor from exiting.
    c = Counter.remote()
    tune.register_env("counting_cartpole", lambda _: CountingCartpole(c))
    trials = tune.run(
        "PG",
        stop={"training_iteration": 100},
        config={
            "env": "counting_cartpole",
            "callbacks": {"on_episode_end": on_episode_end},
        }
    )

0reactions

aidanmclaughlincommented, Nov 1, 2022

@janblumenkamp that issue persists for me still. I have a 20-thread processor, but when num_workers > 2, I get this error. Any progress?

Top Results From Across the Web

The actor died unexpectedly before finishing this task ( Ray1 ...

I am running Ray rllib on sagemaker with 8 cores CPU using the sagemaker_rl library, I set num_workers to 7.

The actor died unexpectedly before finishing this task - Ray

Hello folks,. I'm using Ray version 1.3.0 to do data partition and to exchange partitions between nodes with following code:

ray Changelog - pyup.io

Set actor died error message in ActorDiedError 20903 - Event stats is enabled by default 21515 Fixes - Better support for nested tasks...

Ray Documentation - Read the Docs

tor should be reconstructed when it dies unexpectedly. The minimum valid value is 0 ... not release it when the task finishes executing....

Using Pytest Patch Decorator To Test Ray Actors Remote Function

.[rllib] Frequent the actor died unexpectedly before finishing this task errors with executions ops in Ray/RLLib 0.8.7+ #11239. Right Way to Test Mock...