question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

[rllib] Custom model cannot use GPU for driver when running PPO algorithm

See original GitHub issue

What is the problem?

When using the combination of a custom model, PPO, and a GPU for the driver, the following error appears:

tensorflow.python.framework.errors_impl.InvalidArgumentError: Cannot assign a device for operation default_policy/lstm/bias/Initializer/concat: Could not satisfy explicit device specification '' because the node {{colocation_node default_policy/lstm/bias/Initializer/concat}} was colocated with a group of nodes that required incompatible device '/job:localhost/replica:0/task:0/device:GPU:0'. All available devices [/job:localhost/replica:0/task:0/device:CPU:0, /job:localhost/replica:0/task:0/device:XLA_CPU:0, /job:localhost/replica:0/task:0/device:XLA_GPU:0].

Ray version and other system information (Python version, TensorFlow version, OS):

Ray 0.8.0 Python 3.6.6 tensorflow-gpu 2.0.0 Fedora 28

Does the problem occur on the latest wheels?

Yes, although it gives a different error, and a new combination of parameters fails as well: on custom_keras_model with num_gpus set to 1, where ray 0.8.0 does not fail with this combination. The error on the latest wheel is the following:

File "project/venv/lib64/python3.6/site-packages/ray/rllib/evaluation/rollout_worker.py", line 356, in __init__ "GPUs were assigned to this worker by Ray, but " RuntimeError: GPUs were assigned to this worker by Ray, but TensorFlow reports GPU acceleration is disabled. This could be due to a bad CUDA or TF installation.

Summary

Ray version script num_gpus works?
0.8.0 custom_keras_model.py 0 Yes
0.8.0 custom_keras_model.py 1 Yes
0.8.0 custom_keras_rnn_model.py 0 Yes
0.8.0 custom_keras_rnn_model.py 1 No
latest wheel custom_keras_model.py 0 Yes
latest wheel custom_keras_model.py 1 No
latest wheel custom_keras_rnn_model.py 0 Yes
latest wheel custom_keras_rnn_model.py 1 No

Reproduction

Please note that this only reproduces the last row in the table. In order to test custom_keras_model.py, you also need to modify the algorithm used at the top of the file.

python3 -m venv venv . venv/bin/activate pip3 install --upgrade pip setuptools wheel pip3 install tensorflow-gpu==2.0.0 #Install [rllib] dependencies pip3 install ray[rllib]==0.8.0 pip3 install -U https://s3-us-west-2.amazonaws.com/ray-wheels/latest/ray-0.9.0.dev0-cp36-cp36m-manylinux1_x86_64.whl sed -i ‘158i “num_gpus”: 1,’ venv/lib/python3.6/site-packages/ray/rllib/examples/custom_keras_rnn_model.py python3 venv/lib/python3.6/site-packages/ray/rllib/examples/custom_keras_rnn_model.py

Issue Analytics

  • State:closed
  • Created 4 years ago
  • Comments:16 (9 by maintainers)

github_iconTop GitHub Comments

2reactions
internetcoffeephonecommented, Jan 2, 2020

Correct, the repro script above works with the combination of CUDA 10.0.130.1 and cuDNN v7.4.2. Thanks for the help!

1reaction
annaluo676commented, May 19, 2020

I have the same problem with: Python: 3.6.6 Ray 0.8.5 Tensorflow-gpu 2.1.0 CUDA 10.0 CuDNN 7.6.5

Interestingly, with the same config I was able to instantiate a Trainer() and do trainer.train(). Using Ray.tune threw the above error.

Read more comments on GitHub >

github_iconTop Results From Across the Web

Issue with custom LSTMs - RLlib - Ray
This is because Ray does not manage GPU allocations to the driver ... usage on this node: 8.1/15.4 GiB Using FIFO scheduling algorithm....
Read more >
Getting Started with RLlib — Ray 2.2.0 - the Ray documentation
The quickest way to run your first RLlib algorithm is to use the command ... if you wish to use custom environments, preprocessors,...
Read more >
ray.rllib.algorithms.algorithm — Ray 3.0.0.dev0
to_dict()) # In case this algo is using a generic config (with no algo_class set), set it # here. if config.algo_class is None:...
Read more >
Algorithms — Ray 1.11.0
RLlib's multi-GPU optimizer pins that data in GPU memory to avoid unnecessary transfers from host memory, substantially improving performance over a naive ...
Read more >
ray.rllib.algorithms.ddppo.ddppo — Ray 2.2.0
DD-PPO should be used if you have envs that require GPUs to function, or have a very large model that cannot be effectively...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found