Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

[rllib] DDPG ApeX fails with PyTorch and GPU

See original GitHub issue

This is not a contribution.

What is the problem?

I tried running this example (https://github.com/ray-project/ray/blob/master/rllib/tuned_examples/ddpg/mountaincarcontinuous-apex-ddpg.yaml) with PyTorch and GPU (Titan Xp). However, it always fails due to this error:

Failure # 1 (occurred at 2020-10-28_10-12-48)
Traceback (most recent call last):
  File "/usr/local/lib/python3.6/dist-packages/ray/tune/trial_runner.py", line 515, in _process_trial
    result = self.trial_executor.fetch_result(trial)
  File "/usr/local/lib/python3.6/dist-packages/ray/tune/ray_trial_executor.py", line 488, in fetch_result
    result = ray.get(trial_future[0], timeout=DEFAULT_GET_TIMEOUT)
  File "/usr/local/lib/python3.6/dist-packages/ray/worker.py", line 1428, in get
    raise value.as_instanceof_cause()
ray.exceptions.RayTaskError(RuntimeError): ray::APEX_DDPG.train() (pid=39535, ip=XX.XX.XX.XX)
  File "python/ray/_raylet.pyx", line 484, in ray._raylet.execute_task
  File "python/ray/_raylet.pyx", line 438, in ray._raylet.execute_task.function_executor
  File "/usr/local/lib/python3.6/dist-packages/ray/rllib/agents/trainer.py", line 519, in train
    raise e
  File "/usr/local/lib/python3.6/dist-packages/ray/rllib/agents/trainer.py", line 505, in train
    result = Trainable.train(self)
  File "/usr/local/lib/python3.6/dist-packages/ray/tune/trainable.py", line 336, in train
    result = self.step()
  File "/usr/local/lib/python3.6/dist-packages/ray/rllib/agents/trainer_template.py", line 134, in step
    res = next(self.train_exec_impl)
  File "/usr/local/lib/python3.6/dist-packages/ray/util/iter.py", line 756, in __next__
    return next(self.built_iterator)
  File "/usr/local/lib/python3.6/dist-packages/ray/util/iter.py", line 783, in apply_foreach
    for item in it:
  File "/usr/local/lib/python3.6/dist-packages/ray/util/iter.py", line 783, in apply_foreach
    for item in it:
  File "/usr/local/lib/python3.6/dist-packages/ray/util/iter.py", line 843, in apply_filter
    for item in it:
  File "/usr/local/lib/python3.6/dist-packages/ray/util/iter.py", line 843, in apply_filter
    for item in it:
  File "/usr/local/lib/python3.6/dist-packages/ray/util/iter.py", line 783, in apply_foreach
    for item in it:
  File "/usr/local/lib/python3.6/dist-packages/ray/util/iter.py", line 843, in apply_filter
    for item in it:
  File "/usr/local/lib/python3.6/dist-packages/ray/util/iter.py", line 1075, in build_union
    item = next(it)
  File "/usr/local/lib/python3.6/dist-packages/ray/util/iter.py", line 756, in __next__
    return next(self.built_iterator)
  File "/usr/local/lib/python3.6/dist-packages/ray/util/iter.py", line 783, in apply_foreach
    for item in it:
  File "/usr/local/lib/python3.6/dist-packages/ray/util/iter.py", line 783, in apply_foreach
    for item in it:
  File "/usr/local/lib/python3.6/dist-packages/ray/util/iter.py", line 783, in apply_foreach
    for item in it:
  File "/usr/local/lib/python3.6/dist-packages/ray/rllib/execution/concurrency_ops.py", line 132, in base_iterator
    raise RuntimeError("Error raised reading from queue")
RuntimeError: Error raised reading from queue

Ray version and other system information (Python version, TensorFlow version, OS): Ray version: 1.0.0 Python version: 3.6.8 Tensorflow version: 1.15.0 (not running with tf though) Pytorch version: 1…6.0+cu92 OS: Ubuntu 18.04 (in docker container)

Reproduction (REQUIRED)

Please provide a script that can be run to reproduce the issue. The script should have no external library dependencies (i.e., use fake or mock data / environments):

To reproduce, run rllib train -f mountaincarcontinuous-apex-ddpg-pytorch.yaml Here is the content of the file:

# This can be expected to reach 90 reward within ~1.5-2.5m timesteps / ~150-250 seconds on a K40 GPU
mountaincarcontinuous-apex-ddpg-pytorch:
    env: MountainCarContinuous-v0
    run: APEX_DDPG
    stop:
        episode_reward_mean: 90
    config:
        # Works for both torch and tf.
        framework: torch  # <— This line is changed
        clip_rewards: False
        num_workers: 16
        num_gpus: 1  # <— This line is changed
        exploration_config:
            ou_base_scale: 1.0
        n_step: 3
        target_network_update_freq: 50000
        tau: 1.0
        evaluation_interval: 5
        evaluation_num_episodes: 10

I have verified my script runs in a clean environment and reproduces the issue.
I have verified the issue also occurs with the latest wheels.

Thank a lot for your help, Lukas

Issue Analytics

State:
Created 3 years ago
Reactions:1
Comments:6 (3 by maintainers)

Top GitHub Comments

1reaction

chwhiteleycommented, Oct 28, 2020

I also face this issue when running the “custom_rnn_model” example (https://github.com/ray-project/ray/blob/master/rllib/examples/custom_rnn_model.py), with APPO instead of PPO, both with Pytorch and Tensorflow. This issue has also been mentioned in #9436, it would be really helpful if someone found a solution.

"""Example of using a custom RNN keras model."""

import argparse

import ray
from ray import tune
from ray.tune.registry import register_env
from ray.rllib.examples.env.repeat_after_me_env import RepeatAfterMeEnv
from ray.rllib.examples.env.repeat_initial_obs_env import RepeatInitialObsEnv
from ray.rllib.examples.models.rnn_model import RNNModel, TorchRNNModel
from ray.rllib.models import ModelCatalog
from ray.rllib.utils.test_utils import check_learning_achieved

parser = argparse.ArgumentParser()
parser.add_argument("--run", type=str, default="APPO")
parser.add_argument("--env", type=str, default="RepeatAfterMeEnv")
parser.add_argument("--num-cpus", type=int, default=0)
parser.add_argument("--as-test", action="store_true")
parser.add_argument("--torch", default=True, action="store_true")
parser.add_argument("--stop-reward", type=float, default=90)
parser.add_argument("--stop-iters", type=int, default=100)
parser.add_argument("--stop-timesteps", type=int, default=100000)

if __name__ == "__main__":
    args = parser.parse_args()

    ray.init(num_cpus=args.num_cpus or None)

    ModelCatalog.register_custom_model(
        "rnn", TorchRNNModel if args.torch else RNNModel)
    register_env("RepeatAfterMeEnv", lambda c: RepeatAfterMeEnv(c))
    register_env("RepeatInitialObsEnv", lambda _: RepeatInitialObsEnv())

    config = {
        "env": args.env,
        "env_config": {
            "repeat_delay": 2,
        },
        "gamma": 0.9,
        "num_workers": 0,
        "num_envs_per_worker": 20,
        "entropy_coeff": 0.001,
        "num_sgd_iter": 5,
        "vf_loss_coeff": 1e-5,
        "model": {
            "custom_model": "rnn",
            "max_seq_len": 20,
        },
        "framework": "torch" if args.torch else "tf",
    }

    stop = {
        "training_iteration": args.stop_iters,
        "timesteps_total": args.stop_timesteps,
        "episode_reward_mean": args.stop_reward,
    }

    results = tune.run(args.run, config=config, stop=stop)

    if args.as_test:
        check_learning_achieved(results, args.stop_reward)
    ray.shutdown()

Ray version and other system information (Python version, TensorFlow version, OS): Ray version: 0.8.6 Python version: 3.7.9 Tensorflow version: 1.15.0 Pytorch version: 1.4.0+cpu OS: Windows 10 Entreprise, version 1809

0reactions

sven1977commented, Dec 14, 2020

Does seem like a PyTorch/cuda/cudnn bug. e.g. https://github.com/pytorch/pytorch/issues/21819

Top Results From Across the Web

Algorithms — Ray 2.2.0 - the Ray documentation

RLlib's multi-GPU optimizer pins that data in GPU memory to avoid unnecessary transfers from host memory, substantially improving performance over a naive ...

Algorithms — Ray 1.11.0

RLlib's multi-GPU optimizer pins that data in GPU memory to avoid unnecessary transfers from host memory, substantially improving performance over a naive ...

RLlib Algorithms — Ray 0.8.7 documentation

[paper] [implementation] Ape-X variations of DQN and DDPG (APEX_DQN, APEX_DDPG) use a single GPU learner and many CPU workers for experience collection.

Getting Started with RLlib — Ray 2.2.0 - the Ray documentation

At a high level, RLlib provides you with an Algorithm class which holds a policy for environment interaction. Through the algorithm's interface, you...

Online reinforcement learning with Ray AIR

rllib.utils.replay_buffers` instead. This will raise an error in the future! 2022-05-19 13:54:16,531 WARNING deprecation.py: ...