[rllib] DDPG ApeX fails with PyTorch and GPU
See original GitHub issueThis is not a contribution.
What is the problem?
I tried running this example (https://github.com/ray-project/ray/blob/master/rllib/tuned_examples/ddpg/mountaincarcontinuous-apex-ddpg.yaml) with PyTorch and GPU (Titan Xp). However, it always fails due to this error:
Failure # 1 (occurred at 2020-10-28_10-12-48)
Traceback (most recent call last):
File "/usr/local/lib/python3.6/dist-packages/ray/tune/trial_runner.py", line 515, in _process_trial
result = self.trial_executor.fetch_result(trial)
File "/usr/local/lib/python3.6/dist-packages/ray/tune/ray_trial_executor.py", line 488, in fetch_result
result = ray.get(trial_future[0], timeout=DEFAULT_GET_TIMEOUT)
File "/usr/local/lib/python3.6/dist-packages/ray/worker.py", line 1428, in get
raise value.as_instanceof_cause()
ray.exceptions.RayTaskError(RuntimeError): ray::APEX_DDPG.train() (pid=39535, ip=XX.XX.XX.XX)
File "python/ray/_raylet.pyx", line 484, in ray._raylet.execute_task
File "python/ray/_raylet.pyx", line 438, in ray._raylet.execute_task.function_executor
File "/usr/local/lib/python3.6/dist-packages/ray/rllib/agents/trainer.py", line 519, in train
raise e
File "/usr/local/lib/python3.6/dist-packages/ray/rllib/agents/trainer.py", line 505, in train
result = Trainable.train(self)
File "/usr/local/lib/python3.6/dist-packages/ray/tune/trainable.py", line 336, in train
result = self.step()
File "/usr/local/lib/python3.6/dist-packages/ray/rllib/agents/trainer_template.py", line 134, in step
res = next(self.train_exec_impl)
File "/usr/local/lib/python3.6/dist-packages/ray/util/iter.py", line 756, in __next__
return next(self.built_iterator)
File "/usr/local/lib/python3.6/dist-packages/ray/util/iter.py", line 783, in apply_foreach
for item in it:
File "/usr/local/lib/python3.6/dist-packages/ray/util/iter.py", line 783, in apply_foreach
for item in it:
File "/usr/local/lib/python3.6/dist-packages/ray/util/iter.py", line 843, in apply_filter
for item in it:
File "/usr/local/lib/python3.6/dist-packages/ray/util/iter.py", line 843, in apply_filter
for item in it:
File "/usr/local/lib/python3.6/dist-packages/ray/util/iter.py", line 783, in apply_foreach
for item in it:
File "/usr/local/lib/python3.6/dist-packages/ray/util/iter.py", line 843, in apply_filter
for item in it:
File "/usr/local/lib/python3.6/dist-packages/ray/util/iter.py", line 1075, in build_union
item = next(it)
File "/usr/local/lib/python3.6/dist-packages/ray/util/iter.py", line 756, in __next__
return next(self.built_iterator)
File "/usr/local/lib/python3.6/dist-packages/ray/util/iter.py", line 783, in apply_foreach
for item in it:
File "/usr/local/lib/python3.6/dist-packages/ray/util/iter.py", line 783, in apply_foreach
for item in it:
File "/usr/local/lib/python3.6/dist-packages/ray/util/iter.py", line 783, in apply_foreach
for item in it:
File "/usr/local/lib/python3.6/dist-packages/ray/rllib/execution/concurrency_ops.py", line 132, in base_iterator
raise RuntimeError("Error raised reading from queue")
RuntimeError: Error raised reading from queue
Ray version and other system information (Python version, TensorFlow version, OS): Ray version: 1.0.0 Python version: 3.6.8 Tensorflow version: 1.15.0 (not running with tf though) Pytorch version: 1…6.0+cu92 OS: Ubuntu 18.04 (in docker container)
Reproduction (REQUIRED)
Please provide a script that can be run to reproduce the issue. The script should have no external library dependencies (i.e., use fake or mock data / environments):
To reproduce, run rllib train -f mountaincarcontinuous-apex-ddpg-pytorch.yaml Here is the content of the file:
# This can be expected to reach 90 reward within ~1.5-2.5m timesteps / ~150-250 seconds on a K40 GPU
mountaincarcontinuous-apex-ddpg-pytorch:
env: MountainCarContinuous-v0
run: APEX_DDPG
stop:
episode_reward_mean: 90
config:
# Works for both torch and tf.
framework: torch # <— This line is changed
clip_rewards: False
num_workers: 16
num_gpus: 1 # <— This line is changed
exploration_config:
ou_base_scale: 1.0
n_step: 3
target_network_update_freq: 50000
tau: 1.0
evaluation_interval: 5
evaluation_num_episodes: 10
- I have verified my script runs in a clean environment and reproduces the issue.
- I have verified the issue also occurs with the latest wheels.
Thank a lot for your help, Lukas
Issue Analytics
- State:
- Created 3 years ago
- Reactions:1
- Comments:6 (3 by maintainers)

Top Related StackOverflow Question
I also face this issue when running the “custom_rnn_model” example (https://github.com/ray-project/ray/blob/master/rllib/examples/custom_rnn_model.py), with APPO instead of PPO, both with Pytorch and Tensorflow. This issue has also been mentioned in #9436, it would be really helpful if someone found a solution.
Ray version and other system information (Python version, TensorFlow version, OS): Ray version: 0.8.6 Python version: 3.7.9 Tensorflow version: 1.15.0 Pytorch version: 1.4.0+cpu OS: Windows 10 Entreprise, version 1809
Does seem like a PyTorch/cuda/cudnn bug. e.g. https://github.com/pytorch/pytorch/issues/21819