Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

[rllib] Custom model cannot use GPU for driver when running PPO algorithm

See original GitHub issue

What is the problem?

When using the combination of a custom model, PPO, and a GPU for the driver, the following error appears:

tensorflow.python.framework.errors_impl.InvalidArgumentError: Cannot assign a device for operation default_policy/lstm/bias/Initializer/concat: Could not satisfy explicit device specification '' because the node {{colocation_node default_policy/lstm/bias/Initializer/concat}} was colocated with a group of nodes that required incompatible device '/job:localhost/replica:0/task:0/device:GPU:0'. All available devices [/job:localhost/replica:0/task:0/device:CPU:0, /job:localhost/replica:0/task:0/device:XLA_CPU:0, /job:localhost/replica:0/task:0/device:XLA_GPU:0].

Ray version and other system information (Python version, TensorFlow version, OS):

Ray 0.8.0 Python 3.6.6 tensorflow-gpu 2.0.0 Fedora 28

Does the problem occur on the latest wheels?

Yes, although it gives a different error, and a new combination of parameters fails as well: on custom_keras_model with num_gpus set to 1, where ray 0.8.0 does not fail with this combination. The error on the latest wheel is the following:

File "project/venv/lib64/python3.6/site-packages/ray/rllib/evaluation/rollout_worker.py", line 356, in __init__ "GPUs were assigned to this worker by Ray, but " RuntimeError: GPUs were assigned to this worker by Ray, but TensorFlow reports GPU acceleration is disabled. This could be due to a bad CUDA or TF installation.

Summary

Ray version	script	num_gpus	works?
0.8.0	custom_keras_model.py	0	Yes
0.8.0	custom_keras_model.py	1	Yes
0.8.0	custom_keras_rnn_model.py	0	Yes
0.8.0	custom_keras_rnn_model.py	1	No
latest wheel	custom_keras_model.py	0	Yes
latest wheel	custom_keras_model.py	1	No
latest wheel	custom_keras_rnn_model.py	0	Yes
latest wheel	custom_keras_rnn_model.py	1	No

Reproduction

Please note that this only reproduces the last row in the table. In order to test custom_keras_model.py, you also need to modify the algorithm used at the top of the file.

python3 -m venv venv . venv/bin/activate pip3 install --upgrade pip setuptools wheel pip3 install tensorflow-gpu==2.0.0 #Install [rllib] dependencies pip3 install ray[rllib]==0.8.0 pip3 install -U https://s3-us-west-2.amazonaws.com/ray-wheels/latest/ray-0.9.0.dev0-cp36-cp36m-manylinux1_x86_64.whl sed -i ‘158i “num_gpus”: 1,’ venv/lib/python3.6/site-packages/ray/rllib/examples/custom_keras_rnn_model.py python3 venv/lib/python3.6/site-packages/ray/rllib/examples/custom_keras_rnn_model.py

Issue Analytics

State:
Created 4 years ago
Comments:16 (9 by maintainers)

Top GitHub Comments

2reactions

internetcoffeephonecommented, Jan 2, 2020

Correct, the repro script above works with the combination of CUDA 10.0.130.1 and cuDNN v7.4.2. Thanks for the help!

1reaction

annaluo676commented, May 19, 2020

I have the same problem with: Python: 3.6.6 Ray 0.8.5 Tensorflow-gpu 2.1.0 CUDA 10.0 CuDNN 7.6.5

Interestingly, with the same config I was able to instantiate a Trainer() and do trainer.train(). Using Ray.tune threw the above error.

Top Results From Across the Web

Issue with custom LSTMs - RLlib - Ray

This is because Ray does not manage GPU allocations to the driver ... usage on this node: 8.1/15.4 GiB Using FIFO scheduling algorithm....

Getting Started with RLlib — Ray 2.2.0 - the Ray documentation

The quickest way to run your first RLlib algorithm is to use the command ... if you wish to use custom environments, preprocessors,...

ray.rllib.algorithms.algorithm — Ray 3.0.0.dev0

to_dict()) # In case this algo is using a generic config (with no algo_class set), set it # here. if config.algo_class is None:...

Algorithms — Ray 1.11.0

RLlib's multi-GPU optimizer pins that data in GPU memory to avoid unnecessary transfers from host memory, substantially improving performance over a naive ...

ray.rllib.algorithms.ddppo.ddppo — Ray 2.2.0

DD-PPO should be used if you have envs that require GPUs to function, or have a very large model that cannot be effectively...