Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

NVIDIA GPU Does not work on microk8s Kuberenters (host has NVIDIA driver)

See original GitHub issue

Results of tests

NVIDIA GPU Works In Bare-metal as host OS has NVIDIA driver but does not work in microK8s even with helm install --wait --generate-name nvidia/gpu-operator --set driver.enabled=false

All details below [ inspection-report-20211007_222539.tar.gz ](url)

alex@pop-os:~/kubeflow/manifests$ nvidia-smi
Thu Oct  7 17:03:47 2021       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 470.63.01    Driver Version: 470.63.01    CUDA Version: 11.4     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  NVIDIA GeForce ...  Off  | 00000000:01:00.0  On |                  N/A |
| N/A   38C    P8     9W /  N/A |    393MiB /  5946MiB |     11%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|    0   N/A  N/A      1633      G   /usr/lib/xorg/Xorg                248MiB |
|    0   N/A  N/A      2240      G   /usr/bin/gnome-shell               69MiB |
|    0   N/A  N/A   3831278      G   ...AAAAAAAAA= --shared-files       72MiB |
+-----------------------------------------------------------------------------+

Installed nvida-container-runtime for docker / nvidia-docker2 = (modified veriosn of runc) + docker config setting to use this insread of runc.

https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/install-guide.html

Also changed containerd to use nvidia-container-runtime as per https://docs.nvidia.com/datacenter/cloud-native/gpu-operator/getting-started.html#bare-metal-passthrough-with-with-pre-installed-nvidia-drivers

Containerd config

alex@pop-os:~$ cat  /etc/containerd/config.toml

#disabled_plugins = ["cri"]

#root = "/var/lib/containerd"
#state = "/run/containerd"
#subreaper = true
#oom_score = 0

#[grpc]
#  address = "/run/containerd/containerd.sock"
#  uid = 0
#  gid = 0

#[debug]
#  address = "/run/containerd/debug.sock"
#  uid = 0
#  gid = 0
#  level = "info"
privileged_without_host_devices = false
base_runtime_spec = ""
[plugins."io.containerd.grpc.v1.cri".containerd.runtimes.runc.options]
      SystemdCgroup = true
[plugins."io.containerd.grpc.v1.cri".containerd.runtimes.nvidia]
     privileged_without_host_devices = false
     runtime_engine = ""
     runtime_root = ""
     runtime_type = "io.containerd.runc.v1"
     [plugins."io.containerd.grpc.v1.cri".containerd.runtimes.nvidia.options]
         BinaryName = "/usr/bin/nvidia-container-runtime"
         SystemdCgroup = true
[plugins."io.containerd.grpc.v1.cri".cni]
    bin_dir = "/opt/cni/bin"
    conf_dir = "/etc/cni/net.d"

Docker config

alex@pop-os:~$ cat /etc/docker/daemon.json 
{
    "default-runtime": "nvidia",
    "runtimes": {
        "nvidia": {
            "args": [],
            "path": "/usr/local/nvidia/toolkit/nvidia-container-runtime"
        },
        "nvidia-experimental": {
            "args": [],
            "path": "/usr/local/nvidia/toolkit/nvidia-container-runtime-experimental"
        }
    }
}a

lex@pop-os:~$ sudo nvidia-container-cli --load-kmods info
NVRM version:   470.63.01
CUDA version:   11.4

Device Index:   0
Device Minor:   0
Model:          NVIDIA GeForce RTX 3060 Laptop GPU
Brand:          GeForce
GPU UUID:       GPU-3d4037fa-1de1-b359-c959-bfb3d9ecbe50
Bus Location:   00000000:01:00.0
Architecture:   8.6

alex@pop-os:~$ sudo ctr run --rm --gpus 0 -t docker.io/nvidia/cuda:11.0-base cuda-11.0-base nvidia-smi
Thu Oct  7 13:10:18 2021       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 470.63.01    Driver Version: 470.63.01    CUDA Version: 11.4     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  NVIDIA GeForce ...  Off  | 00000000:01:00.0  On |                  N/A |
| N/A   40C    P8     8W /  N/A |    337MiB /  5946MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
---

NVIDIA GPU Works On Docker on BareMetal

alex@pop-os:~$ sudo docker run --rm --runtime=nvidia -ti nvidia/cuda:11.0-base
root@2ee8a12130d5:/# nvidia-smi
Thu Oct  7 13:10:47 2021       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 470.63.01    Driver Version: 470.63.01    CUDA Version: 11.4     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  NVIDIA GeForce ...  Off  | 00000000:01:00.0  On |                  N/A |
| N/A   40C    P8     8W /  N/A |    337MiB /  5946MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+

NVIDIA NGC Pytorch https://developer.nvidia.com/blog/gpu-containers-runtime/

sudo docker run -it --runtime=nvidia --shm-size=1g -e NVIDIA_VISIBLE_DEVICES=0 --rm nvcr.io/nvidia/pytorch:21.09-py3

workspace/examples/upstream/mnist: python main.py
Training also works

NVIDIA GPU Does not work on microk8s Kuberenters

Note - Host OS already has Nvidia driver

microk8s enable gpu

default                     gpu-operator-node-feature-discovery-worker-qmgfx              1/1     Running                 0          96s
default                     gpu-operator-node-feature-discovery-master-58d884d5cc-6kxtx   1/1     Running                 0          96s
default                     gpu-operator-5f8b7c4f59-kq2qg                                 1/1     Running                 0          96s
gpu-operator-resources      nvidia-dcgm-fdt74                                             0/1     Init:0/1                0          21s
gpu-operator-resources      nvidia-dcgm-exporter-8qp77                                    0/1     Init:0/1                0          21s
gpu-operator-resources      gpu-feature-discovery-pl4xk                                   0/1     Init:0/1                0          21s
gpu-operator-resources      nvidia-operator-validator-2z4gn                               0/1     Init:0/4                0          20s
gpu-operator-resources      nvidia-device-plugin-daemonset-zzqm2                          0/1     Init:0/1                0          20s
gpu-operator-resources      nvidia-container-toolkit-daemonset-mfndg                      0/1     Init:0/1                0          21s
gpu-operator-resources      nvidia-driver-daemonset-kbtcp                                 0/1     Init:CrashLoopBackOff   2          72s

Error

alex@pop-os:~$ kubectl -n gpu-operator-resources logs  nvidia-driver-daemonset-kbtcp -c k8s-driver-manager
nvidia driver module is already loaded with refcount 384
Shutting down all GPU clients on the current node by disabling their component-specific nodeSelector labels
node/pop-os labeled
Waiting for the operator-validator to shutdown
pod/nvidia-operator-validator-phcnx condition met
Waiting for the container-toolkit to shutdown
Waiting for the device-plugin to shutdown
Waiting for gpu-feature-discovery to shutdown
Waiting for dcgm-exporter to shutdown
Unloading NVIDIA driver kernel modules...
nvidia_uvm           1048576  0
nvidia_drm             61440  5
nvidia_modeset       1196032  7 nvidia_drm
nvidia              35270656  384 nvidia_uvm,nvidia_modeset
drm_kms_helper        258048  2 amdgpu,nvidia_drm
drm                   561152  14 gpu_sched,drm_kms_helper,nvidia,amdgpu,drm_ttm_helper,nvidia_drm,ttm
Could not unload NVIDIA driver kernel modules, driver is in use
Unable to cleanup driver modules, attempting again with node drain...
Draining node pop-os...
node/pop-os cordoned
DEPRECATED WARNING: Aborting the drain command in a list of nodes will be deprecated in v1.23.
The new behavior will make the drain command go through all nodes even if one or more nodes failed during the drain.
For now, users can try such experience via: --ignore-errors
error: unable to drain node "pop-os", aborting command...

There are pending nodes to be drained:
 pop-os
error: cannot delete Pods with local storage (use --delete-emptydir-data to override): istio-system/istiod-86457659bb-bmpkb, kubeflow-user-example-com/ml-pipeline-visualizationserver-6b44c6759f-vwxqw, kubeflow/ml-pipeline-scheduledworkflow-5db54d75c5-tgb4s, istio-system/istio-ingressgateway-79b665c95-jl8fr, istio-system/cluster-local-gateway-75cb7c6c88-5x4kh, kubeflow/metadata-writer-548bd879bb-ntm6f, kubeflow/ml-pipeline-ui-5bd8d6dc84-nwdbv, kubeflow/ml-pipeline-visualizationserver-8476b5c645-xc26z, knative-serving/networking-istio-6b88f745c-l58vh, kubeflow/minio-5b65df66c9-bwng9, kubeflow/ml-pipeline-viewer-crd-68fb5f4d58-9wq4j, kubeflow/mysql-f7b9b7dd4-shnpw, kubeflow/cache-server-6566dc7dbf-wx5kb, knative-serving/istio-webhook-578b6b7654-4cpt8, kubeflow/workflow-controller-5cbbb49bd8-plk5d, knative-serving/webhook-6fffdc4d78-8w5b7, knative-serving/autoscaler-5c648f7465-mfbxz, kubeflow/ml-pipeline-847f9d7f78-rg22w, kubeflow/tensorboard-controller-controller-manager-6b6dcc6b5b-zd9d2, knative-serving/activator-7476cc56d4-jtmsr, knative-serving/controller-57c545cbfb-x4kpn, kubeflow/metadata-grpc-deployment-6b5685488-phdlt, kubeflow/kfserving-models-web-app-67658874d7-ttjk2, kubeflow/cache-deployer-deployment-79fdf9c5c9-6ndcs, kubeflow-user-example-com/ml-pipeline-ui-artifact-5dd95d555b-ftb2r, kubeflow/ml-pipeline-persistenceagent-d6bdc77bd-qnnm7
Uncordoning node pop-os...
node/pop-os uncordoned
Rescheduling all GPU clients on the current node by enabling their component-specific nodeSelector labels
node/pop-os labeled

Workaround for above https://github.com/NVIDIA/gpu-operator/issues/126

microk8s disable gpu

Solution

Delete exiting GPU-operator namespace kubectl delete crd clusterpolicies.nvidia.com
And follow https://docs.nvidia.com/datacenter/cloud-native/gpu-operator/getting-started.html#bare-metal-passthrough-with-with-pre-installed-nvidia-drivers

And install it manually

helm install --wait --generate-name       nvidia/gpu-operator       --set driver.enabled=false

NAME: gpu-operator-1633622171
LAST DEPLOYED: Thu Oct  7 21:26:26 2021
NAMESPACE: default
STATUS: deployed
REVISION: 1
TEST SUITE: None

Still same error!

default                     gpu-operator-1633622171-node-feature-discovery-worker-622r7       1/1     Running      0          53s
default                     gpu-operator-1633622171-node-feature-discovery-master-7cb5g8fnl   1/1     Running      0          53s
default                     gpu-operator-5f8b7c4f59-8fc58                                     1/1     Running      0          53s
gpu-operator-resources      gpu-feature-discovery-cmd8l                                       0/1     Init:0/1     0          29s
gpu-operator-resources      nvidia-device-plugin-daemonset-ctg97                              0/1     Init:0/1     0          30s
gpu-operator-resources      nvidia-dcgm-vzwqf                                                 0/1     Init:0/1     0          30s
gpu-operator-resources      nvidia-dcgm-exporter-ns9bm                                        0/1     Init:0/1     0          30s
gpu-operator-resources      nvidia-container-toolkit-daemonset-l9cbt                          1/1     Running      0          31s
gpu-operator-resources      nvidia-operator-validator-gmp5w                                   0/1     Init:Error   2          31s

alex@pop-os:~$ kubectl -n gpu-operator-resources logs  nvidia-operator-validator-gmp5w -c driver-validation
running command chroot with args [/run/nvidia/driver nvidia-smi]
Thu Oct  7 21:26:51 2021       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 470.63.01    Driver Version: 470.63.01    CUDA Version: 11.4     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  NVIDIA GeForce ...  Off  | 00000000:01:00.0  On |                  N/A |
| N/A   43C    P0    22W /  N/A |    396MiB /  5946MiB |      9%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|    0   N/A  N/A      2842      G   /usr/lib/xorg/Xorg                227MiB |
|    0   N/A  N/A      3529      G   /usr/bin/gnome-shell               72MiB |
|    0   N/A  N/A     51246      G   ...AAAAAAAAA= --shared-files       94MiB |
+-----------------------------------------------------------------------------+
alex@pop-os:~$ kubectl -n gpu-operator-resources logs  nvidia-operator-validator-gmp5w -c toolkit-validation
toolkit is not ready
time="2021-10-07T16:02:33Z" level=info msg="Error: error validating toolkit installation: exec: \"nvidia-smi\": executable file not found in $PATH"
alex@pop-os:~$ kubectl -n gpu-operator-resources logs  nvidia-operator-validator-gmp5w -c plugin-validation
Error from server (BadRequest): container "plugin-validation" in pod "nvidia-operator-validator-gmp5w" is waiting to start: PodInitializing
alex@pop-os:~$ kubectl -n gpu-operator-resources logs  nvidia-operator-validator-gmp5w -c cuda-validation
Error from server (BadRequest): container "cuda-validation" in pod "nvidia-operator-validator-gmp5w" is waiting to start: PodInitializing

Delete and try with tool kit enabled

alex@pop-os:~$ helm list -aq
gpu-operator-1633622171

helm delete gpu-operator-1633622171

and install with tool kit enabled as true

helm install --wait --generate-name       nvidia/gpu-operator       --set driver.enabled=false  --set toolkit.enabled=true

Still error

alex@pop-os:~$ kubectl -n gpu-operator-resources logs  nvidia-operator-validator-spngp -c driver-validation
running command chroot with args [/run/nvidia/driver nvidia-smi]
Thu Oct  7 21:40:24 2021       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 470.63.01    Driver Version: 470.63.01    CUDA Version: 11.4     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  NVIDIA GeForce ...  Off  | 00000000:01:00.0  On |                  N/A |
| N/A   42C    P8    11W /  N/A |    406MiB /  5946MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|    0   N/A  N/A      2842      G   /usr/lib/xorg/Xorg                227MiB |
|    0   N/A  N/A      3529      G   /usr/bin/gnome-shell               72MiB |
|    0   N/A  N/A     51246      G   ...AAAAAAAAA= --shared-files      103MiB |
+-----------------------------------------------------------------------------+
alex@pop-os:~$ kubectl -n gpu-operator-resources logs  nvidia-operator-validator-spngp -c toolkit-validation
toolkit is not ready
time="2021-10-07T16:41:31Z" level=info msg="Error: error validating toolkit installation: exec: \"nvidia-smi\": executable file not found in $PATH"
alex@pop-os:~$

Issue Analytics

State:
Created 2 years ago
Comments:8 (3 by maintainers)

Top GitHub Comments

3reactions

RW21commented, Apr 20, 2022

Is there any way to install the gpu-operator on 1.21? I want to run it with kubeflow which only supports -1.21.

3reactions

gigonycommented, Apr 11, 2022

It worked perfactly with v1.22!

# To prevent "Error: repository name (nvidia) already exists, please specify a different name" message when installing new microk8s
microk8s helm3 repo remove nvidia

sudo snap remove microk8s
sudo snap install microk8s --classic --channel=1.22
sudo usermod -a -G microk8s $USER
sudo chown -f -R $USER ~/.kube
#su - $USER
newgrp microk8s

microk8s status --wait-ready
microk8s kubectl get nodes
microk8s kubectl get services

microk8s enable gpu