NVIDIA GPU Does not work on microk8s Kuberenters (host has NVIDIA driver)
See original GitHub issueResults of tests
NVIDIA GPU Works In Bare-metal as host OS has NVIDIA driver but does not work in microK8s even with helm install --wait --generate-name nvidia/gpu-operator --set driver.enabled=false
All details below [ inspection-report-20211007_222539.tar.gz ](url)
alex@pop-os:~/kubeflow/manifests$ nvidia-smi
Thu Oct 7 17:03:47 2021
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 470.63.01 Driver Version: 470.63.01 CUDA Version: 11.4 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 NVIDIA GeForce ... Off | 00000000:01:00.0 On | N/A |
| N/A 38C P8 9W / N/A | 393MiB / 5946MiB | 11% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| 0 N/A N/A 1633 G /usr/lib/xorg/Xorg 248MiB |
| 0 N/A N/A 2240 G /usr/bin/gnome-shell 69MiB |
| 0 N/A N/A 3831278 G ...AAAAAAAAA= --shared-files 72MiB |
+-----------------------------------------------------------------------------+
Installed nvida-container-runtime for docker / nvidia-docker2 = (modified veriosn of runc) + docker config setting to use this insread of runc.
https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/install-guide.html
Also changed containerd to use nvidia-container-runtime as per https://docs.nvidia.com/datacenter/cloud-native/gpu-operator/getting-started.html#bare-metal-passthrough-with-with-pre-installed-nvidia-drivers
Containerd config
alex@pop-os:~$ cat /etc/containerd/config.toml
#disabled_plugins = ["cri"]
#root = "/var/lib/containerd"
#state = "/run/containerd"
#subreaper = true
#oom_score = 0
#[grpc]
# address = "/run/containerd/containerd.sock"
# uid = 0
# gid = 0
#[debug]
# address = "/run/containerd/debug.sock"
# uid = 0
# gid = 0
# level = "info"
privileged_without_host_devices = false
base_runtime_spec = ""
[plugins."io.containerd.grpc.v1.cri".containerd.runtimes.runc.options]
SystemdCgroup = true
[plugins."io.containerd.grpc.v1.cri".containerd.runtimes.nvidia]
privileged_without_host_devices = false
runtime_engine = ""
runtime_root = ""
runtime_type = "io.containerd.runc.v1"
[plugins."io.containerd.grpc.v1.cri".containerd.runtimes.nvidia.options]
BinaryName = "/usr/bin/nvidia-container-runtime"
SystemdCgroup = true
[plugins."io.containerd.grpc.v1.cri".cni]
bin_dir = "/opt/cni/bin"
conf_dir = "/etc/cni/net.d"
Docker config
alex@pop-os:~$ cat /etc/docker/daemon.json
{
"default-runtime": "nvidia",
"runtimes": {
"nvidia": {
"args": [],
"path": "/usr/local/nvidia/toolkit/nvidia-container-runtime"
},
"nvidia-experimental": {
"args": [],
"path": "/usr/local/nvidia/toolkit/nvidia-container-runtime-experimental"
}
}
}a
lex@pop-os:~$ sudo nvidia-container-cli --load-kmods info
NVRM version: 470.63.01
CUDA version: 11.4
Device Index: 0
Device Minor: 0
Model: NVIDIA GeForce RTX 3060 Laptop GPU
Brand: GeForce
GPU UUID: GPU-3d4037fa-1de1-b359-c959-bfb3d9ecbe50
Bus Location: 00000000:01:00.0
Architecture: 8.6
alex@pop-os:~$ sudo ctr run --rm --gpus 0 -t docker.io/nvidia/cuda:11.0-base cuda-11.0-base nvidia-smi
Thu Oct 7 13:10:18 2021
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 470.63.01 Driver Version: 470.63.01 CUDA Version: 11.4 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 NVIDIA GeForce ... Off | 00000000:01:00.0 On | N/A |
| N/A 40C P8 8W / N/A | 337MiB / 5946MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
---
NVIDIA GPU Works On Docker on BareMetal
alex@pop-os:~$ sudo docker run --rm --runtime=nvidia -ti nvidia/cuda:11.0-base
root@2ee8a12130d5:/# nvidia-smi
Thu Oct 7 13:10:47 2021
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 470.63.01 Driver Version: 470.63.01 CUDA Version: 11.4 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 NVIDIA GeForce ... Off | 00000000:01:00.0 On | N/A |
| N/A 40C P8 8W / N/A | 337MiB / 5946MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
NVIDIA NGC Pytorch https://developer.nvidia.com/blog/gpu-containers-runtime/
sudo docker run -it --runtime=nvidia --shm-size=1g -e NVIDIA_VISIBLE_DEVICES=0 --rm nvcr.io/nvidia/pytorch:21.09-py3
workspace/examples/upstream/mnist: python main.py
Training also works
NVIDIA GPU Does not work on microk8s Kuberenters
Note - Host OS already has Nvidia driver
microk8s enable gpu
default gpu-operator-node-feature-discovery-worker-qmgfx 1/1 Running 0 96s
default gpu-operator-node-feature-discovery-master-58d884d5cc-6kxtx 1/1 Running 0 96s
default gpu-operator-5f8b7c4f59-kq2qg 1/1 Running 0 96s
gpu-operator-resources nvidia-dcgm-fdt74 0/1 Init:0/1 0 21s
gpu-operator-resources nvidia-dcgm-exporter-8qp77 0/1 Init:0/1 0 21s
gpu-operator-resources gpu-feature-discovery-pl4xk 0/1 Init:0/1 0 21s
gpu-operator-resources nvidia-operator-validator-2z4gn 0/1 Init:0/4 0 20s
gpu-operator-resources nvidia-device-plugin-daemonset-zzqm2 0/1 Init:0/1 0 20s
gpu-operator-resources nvidia-container-toolkit-daemonset-mfndg 0/1 Init:0/1 0 21s
gpu-operator-resources nvidia-driver-daemonset-kbtcp 0/1 Init:CrashLoopBackOff 2 72s
Error
alex@pop-os:~$ kubectl -n gpu-operator-resources logs nvidia-driver-daemonset-kbtcp -c k8s-driver-manager
nvidia driver module is already loaded with refcount 384
Shutting down all GPU clients on the current node by disabling their component-specific nodeSelector labels
node/pop-os labeled
Waiting for the operator-validator to shutdown
pod/nvidia-operator-validator-phcnx condition met
Waiting for the container-toolkit to shutdown
Waiting for the device-plugin to shutdown
Waiting for gpu-feature-discovery to shutdown
Waiting for dcgm-exporter to shutdown
Unloading NVIDIA driver kernel modules...
nvidia_uvm 1048576 0
nvidia_drm 61440 5
nvidia_modeset 1196032 7 nvidia_drm
nvidia 35270656 384 nvidia_uvm,nvidia_modeset
drm_kms_helper 258048 2 amdgpu,nvidia_drm
drm 561152 14 gpu_sched,drm_kms_helper,nvidia,amdgpu,drm_ttm_helper,nvidia_drm,ttm
Could not unload NVIDIA driver kernel modules, driver is in use
Unable to cleanup driver modules, attempting again with node drain...
Draining node pop-os...
node/pop-os cordoned
DEPRECATED WARNING: Aborting the drain command in a list of nodes will be deprecated in v1.23.
The new behavior will make the drain command go through all nodes even if one or more nodes failed during the drain.
For now, users can try such experience via: --ignore-errors
error: unable to drain node "pop-os", aborting command...
There are pending nodes to be drained:
pop-os
error: cannot delete Pods with local storage (use --delete-emptydir-data to override): istio-system/istiod-86457659bb-bmpkb, kubeflow-user-example-com/ml-pipeline-visualizationserver-6b44c6759f-vwxqw, kubeflow/ml-pipeline-scheduledworkflow-5db54d75c5-tgb4s, istio-system/istio-ingressgateway-79b665c95-jl8fr, istio-system/cluster-local-gateway-75cb7c6c88-5x4kh, kubeflow/metadata-writer-548bd879bb-ntm6f, kubeflow/ml-pipeline-ui-5bd8d6dc84-nwdbv, kubeflow/ml-pipeline-visualizationserver-8476b5c645-xc26z, knative-serving/networking-istio-6b88f745c-l58vh, kubeflow/minio-5b65df66c9-bwng9, kubeflow/ml-pipeline-viewer-crd-68fb5f4d58-9wq4j, kubeflow/mysql-f7b9b7dd4-shnpw, kubeflow/cache-server-6566dc7dbf-wx5kb, knative-serving/istio-webhook-578b6b7654-4cpt8, kubeflow/workflow-controller-5cbbb49bd8-plk5d, knative-serving/webhook-6fffdc4d78-8w5b7, knative-serving/autoscaler-5c648f7465-mfbxz, kubeflow/ml-pipeline-847f9d7f78-rg22w, kubeflow/tensorboard-controller-controller-manager-6b6dcc6b5b-zd9d2, knative-serving/activator-7476cc56d4-jtmsr, knative-serving/controller-57c545cbfb-x4kpn, kubeflow/metadata-grpc-deployment-6b5685488-phdlt, kubeflow/kfserving-models-web-app-67658874d7-ttjk2, kubeflow/cache-deployer-deployment-79fdf9c5c9-6ndcs, kubeflow-user-example-com/ml-pipeline-ui-artifact-5dd95d555b-ftb2r, kubeflow/ml-pipeline-persistenceagent-d6bdc77bd-qnnm7
Uncordoning node pop-os...
node/pop-os uncordoned
Rescheduling all GPU clients on the current node by enabling their component-specific nodeSelector labels
node/pop-os labeled
Workaround for above https://github.com/NVIDIA/gpu-operator/issues/126
microk8s disable gpu
Solution
-
Delete exiting GPU-operator namespace
kubectl delete crd clusterpolicies.nvidia.com
And install it manually
helm install --wait --generate-name nvidia/gpu-operator --set driver.enabled=false
NAME: gpu-operator-1633622171
LAST DEPLOYED: Thu Oct 7 21:26:26 2021
NAMESPACE: default
STATUS: deployed
REVISION: 1
TEST SUITE: None
Still same error!
default gpu-operator-1633622171-node-feature-discovery-worker-622r7 1/1 Running 0 53s
default gpu-operator-1633622171-node-feature-discovery-master-7cb5g8fnl 1/1 Running 0 53s
default gpu-operator-5f8b7c4f59-8fc58 1/1 Running 0 53s
gpu-operator-resources gpu-feature-discovery-cmd8l 0/1 Init:0/1 0 29s
gpu-operator-resources nvidia-device-plugin-daemonset-ctg97 0/1 Init:0/1 0 30s
gpu-operator-resources nvidia-dcgm-vzwqf 0/1 Init:0/1 0 30s
gpu-operator-resources nvidia-dcgm-exporter-ns9bm 0/1 Init:0/1 0 30s
gpu-operator-resources nvidia-container-toolkit-daemonset-l9cbt 1/1 Running 0 31s
gpu-operator-resources nvidia-operator-validator-gmp5w 0/1 Init:Error 2 31s
alex@pop-os:~$ kubectl -n gpu-operator-resources logs nvidia-operator-validator-gmp5w -c driver-validation
running command chroot with args [/run/nvidia/driver nvidia-smi]
Thu Oct 7 21:26:51 2021
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 470.63.01 Driver Version: 470.63.01 CUDA Version: 11.4 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 NVIDIA GeForce ... Off | 00000000:01:00.0 On | N/A |
| N/A 43C P0 22W / N/A | 396MiB / 5946MiB | 9% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| 0 N/A N/A 2842 G /usr/lib/xorg/Xorg 227MiB |
| 0 N/A N/A 3529 G /usr/bin/gnome-shell 72MiB |
| 0 N/A N/A 51246 G ...AAAAAAAAA= --shared-files 94MiB |
+-----------------------------------------------------------------------------+
alex@pop-os:~$ kubectl -n gpu-operator-resources logs nvidia-operator-validator-gmp5w -c toolkit-validation
toolkit is not ready
time="2021-10-07T16:02:33Z" level=info msg="Error: error validating toolkit installation: exec: \"nvidia-smi\": executable file not found in $PATH"
alex@pop-os:~$ kubectl -n gpu-operator-resources logs nvidia-operator-validator-gmp5w -c plugin-validation
Error from server (BadRequest): container "plugin-validation" in pod "nvidia-operator-validator-gmp5w" is waiting to start: PodInitializing
alex@pop-os:~$ kubectl -n gpu-operator-resources logs nvidia-operator-validator-gmp5w -c cuda-validation
Error from server (BadRequest): container "cuda-validation" in pod "nvidia-operator-validator-gmp5w" is waiting to start: PodInitializing
Delete and try with tool kit enabled
alex@pop-os:~$ helm list -aq
gpu-operator-1633622171
helm delete gpu-operator-1633622171
and install with tool kit enabled as true
helm install --wait --generate-name nvidia/gpu-operator --set driver.enabled=false --set toolkit.enabled=true
Still error
alex@pop-os:~$ kubectl -n gpu-operator-resources logs nvidia-operator-validator-spngp -c driver-validation
running command chroot with args [/run/nvidia/driver nvidia-smi]
Thu Oct 7 21:40:24 2021
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 470.63.01 Driver Version: 470.63.01 CUDA Version: 11.4 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 NVIDIA GeForce ... Off | 00000000:01:00.0 On | N/A |
| N/A 42C P8 11W / N/A | 406MiB / 5946MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| 0 N/A N/A 2842 G /usr/lib/xorg/Xorg 227MiB |
| 0 N/A N/A 3529 G /usr/bin/gnome-shell 72MiB |
| 0 N/A N/A 51246 G ...AAAAAAAAA= --shared-files 103MiB |
+-----------------------------------------------------------------------------+
alex@pop-os:~$ kubectl -n gpu-operator-resources logs nvidia-operator-validator-spngp -c toolkit-validation
toolkit is not ready
time="2021-10-07T16:41:31Z" level=info msg="Error: error validating toolkit installation: exec: \"nvidia-smi\": executable file not found in $PATH"
alex@pop-os:~$
Issue Analytics
- State:
- Created 2 years ago
- Comments:8 (3 by maintainers)
Top GitHub Comments
Is there any way to install the gpu-operator on 1.21? I want to run it with kubeflow which only supports -1.21.
It worked perfactly with v1.22!