GPU is not utilising while running Katib NAS example
See original GitHub issueHi, I am running Katib NAS example I have noticed that it is running but it is not able to utilise the GPU of the machine. So I went inside the pod and try to call tensorflow from the pod then I am getting the following error:-
2022-01-25 11:19:26.568346: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcusolver.so.11'; dlerror: libcusolver.so.11: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /usr/local/nvidia/lib:/usr/local/nvidia/lib64
2022-01-25 11:19:26.569961: W tensorflow/core/common_runtime/gpu/gpu_device.cc:1835] Cannot dlopen some GPU libraries. Please make sure the missing libraries mentioned above are installed properly if you would like to use GPU. Follow the guide at https://www.tensorflow.org/install/gpu for how to download and setup the required libraries for your platform.
Please help with this
Thanks
Issue Analytics
- State:
- Created 2 years ago
- Reactions:1
- Comments:6 (3 by maintainers)
Top Results From Across the Web
Running an Experiment - Kubeflow
This guide describes how to configure and run a Katib experiment. The experiment can perform hyperparameter tuning or a neural architecture ...
Read more >Better ML models with Katib - Towards Data Science
Experiment with HP tuning on their local machine, using a sample of the ... NAS: Katib is one of only two frameworks that...
Read more >End-to-End Hyperparameter Tuning with Katib, Tensorflow ...
When it comes to scalable, GPU accelerated, Machine Learning applications, not everybody has the luxury of bursting into a massive ...
Read more >Accelerating ETL on KubeFlow with RAPIDS
Using RAPIDS on your KubeFlow cluster empowers you to GPU-accelerate your ETL work in both your interactive sessions and ETL pipelines.
Read more >Deployment of ML Models using Kubeflow on Different Cloud ...
While running ML models on developmental ... ways, Airflow was not built with Kubernetes in mind and is more useful for a generic...
Read more >Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start FreeTop Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Top GitHub Comments
@ashissharma97, so you are using GCP. In our on-premise cluster, we have drivers installed on the nodes, so we mount them.
On the link you sent (https://cloud.google.com/container-optimized-os/docs/how-to/run-gpus) there is a note:
NVIDIA GPU drivers: You must install NVIDIA GPU drivers by yourself on your Container-Optimized OS VM instances. This section explains how to install the drivers on Container-Optimized OS VM instances.
meaning the drivers should be installed in the VMs (Kubernetes nodes). It may be possible to install drivers in a container, I just never tested myself, so can’t help much.
Maybe the issue is with CUDA/Tensorflow compatibility, or CUDA installation path. These threads might help if that’s the issue:
No problem @d-gol, BTW thanks for your help I’ll surely check the links you have shared.