Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Ray is not finding GPU but TF, PyTorch and nvcc does

See original GitHub issue

I have two NVIDIA TitanX but Ray isn’t seeing any:

ray.init(num_gpus=2)
print(ray.get_gpu_ids())
# prints []

Ray also prints below inicating no GPUs:

2019-10-16 18:20:17,954 INFO multi_gpu_optimizer.py:93 -- LocalMultiGPUOptimizer devices ['/cpu:0']

But TensorFlow sees all devices:

import tensorflow
from tensorflow.python.client import device_lib
print(device_lib.list_local_devices())

That prints:

[name: "/device:CPU:0"
device_type: "CPU"
...
, name: "/device:XLA_CPU:0"
device_type: "XLA_CPU"
...
, name: "/device:GPU:0"
device_type: "GPU"
...
, name: "/device:GPU:1"
device_type: "GPU"
...
, name: "/device:XLA_GPU:0"
device_type: "XLA_GPU"
...
, name: "/device:XLA_GPU:1"
device_type: "XLA_GPU"
...
]

Similarly,

/usr/local/cuda/bin/nvcc --version
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2019 NVIDIA Corporation
Built on Sun_Jul_28_19:07:16_PDT_2019
Cuda compilation tools, release 10.1, V10.1.243

Why Ray doesn’t see my GPUs?

Issue Analytics

State:
Created 4 years ago
Reactions:8
Comments:14 (7 by maintainers)

Top GitHub Comments

2reactions

SamShowaltercommented, Jul 24, 2021

I am having the same issue as @Wormh0-le. This is preventing me from training a torch policy without ray.tune which I do not which to use. I just want to call .train() on my agent.

1reaction

Wormh0-lecommented, Jun 29, 2021

Thanks, that was helpful although its confusing. This is what happens:

Even if I explicitly init ray with num_gpus=1, ray.get_gpu_ids() is [].

However, if I start PPOTrainer with explicit num_gpus=1 then ray gets GPU. If I don’t set this in config then it doesn’t.

I believe the confusing part is ray.get_gpu_ids() which I thought is the detected GPUs in the system. Instead, it’s actually allocated gpus in the system. I think there should be a method, may be, detected_gpus() so one can test that ray indeed sees GPUs and things are good to go. It would also be great if Ray just allocated GPUs automatically to itself (which should be good perhaps 99% of the times) so we don’t have to worry about this additional config.

and I explicit num_gpus=1，but ray still can’t get GPU, and torch.cuda.is_available() is True. why?