question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

GPU is not utilising while running Katib NAS example

See original GitHub issue

Hi, I am running Katib NAS example I have noticed that it is running but it is not able to utilise the GPU of the machine. So I went inside the pod and try to call tensorflow from the pod then I am getting the following error:-

2022-01-25 11:19:26.568346: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcusolver.so.11'; dlerror: libcusolver.so.11: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /usr/local/nvidia/lib:/usr/local/nvidia/lib64
2022-01-25 11:19:26.569961: W tensorflow/core/common_runtime/gpu/gpu_device.cc:1835] Cannot dlopen some GPU libraries. Please make sure the missing libraries mentioned above are installed properly if you would like to use GPU. Follow the guide at https://www.tensorflow.org/install/gpu for how to download and setup the required libraries for your platform.

Please help with this

Thanks

Issue Analytics

  • State:open
  • Created 2 years ago
  • Reactions:1
  • Comments:6 (3 by maintainers)

github_iconTop GitHub Comments

1reaction
d-golcommented, Feb 8, 2022

@ashissharma97, so you are using GCP. In our on-premise cluster, we have drivers installed on the nodes, so we mount them.

On the link you sent (https://cloud.google.com/container-optimized-os/docs/how-to/run-gpus) there is a note: NVIDIA GPU drivers: You must install NVIDIA GPU drivers by yourself on your Container-Optimized OS VM instances. This section explains how to install the drivers on Container-Optimized OS VM instances.

meaning the drivers should be installed in the VMs (Kubernetes nodes). It may be possible to install drivers in a container, I just never tested myself, so can’t help much.

Maybe the issue is with CUDA/Tensorflow compatibility, or CUDA installation path. These threads might help if that’s the issue:

0reactions
ashissharma97commented, Feb 8, 2022

No problem @d-gol, BTW thanks for your help I’ll surely check the links you have shared.

Read more comments on GitHub >

github_iconTop Results From Across the Web

Running an Experiment - Kubeflow
This guide describes how to configure and run a Katib experiment. The experiment can perform hyperparameter tuning or a neural architecture ...
Read more >
Better ML models with Katib - Towards Data Science
Experiment with HP tuning on their local machine, using a sample of the ... NAS: Katib is one of only two frameworks that...
Read more >
End-to-End Hyperparameter Tuning with Katib, Tensorflow ...
When it comes to scalable, GPU accelerated, Machine Learning applications, not everybody has the luxury of bursting into a massive ...
Read more >
Accelerating ETL on KubeFlow with RAPIDS
Using RAPIDS on your KubeFlow cluster empowers you to GPU-accelerate your ETL work in both your interactive sessions and ETL pipelines.
Read more >
Deployment of ML Models using Kubeflow on Different Cloud ...
While running ML models on developmental ... ways, Airflow was not built with Kubernetes in mind and is more useful for a generic...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found