[rllib] Custom model cannot use GPU for driver when running PPO algorithm
See original GitHub issueWhat is the problem?
When using the combination of a custom model, PPO, and a GPU for the driver, the following error appears:
tensorflow.python.framework.errors_impl.InvalidArgumentError: Cannot assign a device for operation default_policy/lstm/bias/Initializer/concat: Could not satisfy explicit device specification '' because the node {{colocation_node default_policy/lstm/bias/Initializer/concat}} was colocated with a group of nodes that required incompatible device '/job:localhost/replica:0/task:0/device:GPU:0'. All available devices [/job:localhost/replica:0/task:0/device:CPU:0, /job:localhost/replica:0/task:0/device:XLA_CPU:0, /job:localhost/replica:0/task:0/device:XLA_GPU:0].
Ray version and other system information (Python version, TensorFlow version, OS):
Ray 0.8.0 Python 3.6.6 tensorflow-gpu 2.0.0 Fedora 28
Does the problem occur on the latest wheels?
Yes, although it gives a different error, and a new combination of parameters fails as well: on custom_keras_model with num_gpus set to 1, where ray 0.8.0 does not fail with this combination. The error on the latest wheel is the following:
File "project/venv/lib64/python3.6/site-packages/ray/rllib/evaluation/rollout_worker.py", line 356, in __init__ "GPUs were assigned to this worker by Ray, but " RuntimeError: GPUs were assigned to this worker by Ray, but TensorFlow reports GPU acceleration is disabled. This could be due to a bad CUDA or TF installation.
Summary
Ray version | script | num_gpus | works? |
---|---|---|---|
0.8.0 | custom_keras_model.py | 0 | Yes |
0.8.0 | custom_keras_model.py | 1 | Yes |
0.8.0 | custom_keras_rnn_model.py | 0 | Yes |
0.8.0 | custom_keras_rnn_model.py | 1 | No |
latest wheel | custom_keras_model.py | 0 | Yes |
latest wheel | custom_keras_model.py | 1 | No |
latest wheel | custom_keras_rnn_model.py | 0 | Yes |
latest wheel | custom_keras_rnn_model.py | 1 | No |
Reproduction
Please note that this only reproduces the last row in the table. In order to test custom_keras_model.py
, you also need to modify the algorithm used at the top of the file.
python3 -m venv venv . venv/bin/activate pip3 install --upgrade pip setuptools wheel pip3 install tensorflow-gpu==2.0.0 #Install [rllib] dependencies pip3 install ray[rllib]==0.8.0 pip3 install -U https://s3-us-west-2.amazonaws.com/ray-wheels/latest/ray-0.9.0.dev0-cp36-cp36m-manylinux1_x86_64.whl sed -i ‘158i “num_gpus”: 1,’ venv/lib/python3.6/site-packages/ray/rllib/examples/custom_keras_rnn_model.py python3 venv/lib/python3.6/site-packages/ray/rllib/examples/custom_keras_rnn_model.py
Issue Analytics
- State:
- Created 4 years ago
- Comments:16 (9 by maintainers)
Top GitHub Comments
Correct, the repro script above works with the combination of CUDA 10.0.130.1 and cuDNN v7.4.2. Thanks for the help!
I have the same problem with: Python: 3.6.6 Ray 0.8.5 Tensorflow-gpu 2.1.0 CUDA 10.0 CuDNN 7.6.5
Interestingly, with the same config I was able to instantiate a
Trainer()
and dotrainer.train()
. UsingRay.tune
threw the above error.