Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

GPU is not utilising while running Katib NAS example

See original GitHub issue

Hi, I am running Katib NAS example I have noticed that it is running but it is not able to utilise the GPU of the machine. So I went inside the pod and try to call tensorflow from the pod then I am getting the following error:-

2022-01-25 11:19:26.568346: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcusolver.so.11'; dlerror: libcusolver.so.11: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /usr/local/nvidia/lib:/usr/local/nvidia/lib64
2022-01-25 11:19:26.569961: W tensorflow/core/common_runtime/gpu/gpu_device.cc:1835] Cannot dlopen some GPU libraries. Please make sure the missing libraries mentioned above are installed properly if you would like to use GPU. Follow the guide at https://www.tensorflow.org/install/gpu for how to download and setup the required libraries for your platform.

Please help with this

Thanks

Issue Analytics

State:
Created 2 years ago
Reactions:1
Comments:6 (3 by maintainers)

Top GitHub Comments

1reaction

d-golcommented, Feb 8, 2022

@ashissharma97, so you are using GCP. In our on-premise cluster, we have drivers installed on the nodes, so we mount them.

On the link you sent (https://cloud.google.com/container-optimized-os/docs/how-to/run-gpus) there is a note: NVIDIA GPU drivers: You must install NVIDIA GPU drivers by yourself on your Container-Optimized OS VM instances. This section explains how to install the drivers on Container-Optimized OS VM instances.

meaning the drivers should be installed in the VMs (Kubernetes nodes). It may be possible to install drivers in a container, I just never tested myself, so can’t help much.

Maybe the issue is with CUDA/Tensorflow compatibility, or CUDA installation path. These threads might help if that’s the issue: