Are GPU-enabled container runnable with containerd runtime?
See original GitHub issueGPU-enabled pods are failing to start using the gpu addon and trying to stay with containerd (not using docker as default runtime). The cuda-vector-add testing pod remains in pending state and does not start. Looking at the logs of nvidia-device-plugin-daemonset, it also has errors and mentions that the default runtime needs to be changed from containerd to docker.
I would prefer to stay with the containerd runtime and avoid using docker-ce as runtime (nvidia-docker2 depends upon docker-ce), due to other aspects (docker changing iptables rules, which requires a further solution https://github.com/kubernetes/kubernetes/issues/39823#issuecomment-276841124 and https://github.com/ubuntu/microk8s/pull/267)
Seeing reports the k3s is able to run gpu-enabled pods using containerd (https://dev.to/mweibel/add-nvidia-gpu-support-to-k3s-with-containerd-4j17) and that my OS-level containerd is able to run a pod with nvidia-smi I prefer to stay with containerd runtime. Is that somehow possible?
The system on which microk8s is being run on is a Debian Buster 10.4 with Nvidia drivers from Debian Backports and Nvidia docker libraries from nvidia.github.io. Microk8s was installed via snap.
# ldconfig -p | grep cuda
libicudata.so.65 (libc6,x86-64) => /usr/lib/x86_64-linux-gnu/libicudata.so.65
libicudata.so.63 (libc6,x86-64) => /usr/lib/x86_64-linux-gnu/libicudata.so.63
libcudart.so.10.1 (libc6,x86-64) => /usr/lib/x86_64-linux-gnu/libcudart.so.10.1
libcudart.so (libc6,x86-64) => /usr/lib/x86_64-linux-gnu/libcudart.so
libcuda.so.1 (libc6,x86-64) => /usr/lib/x86_64-linux-gnu/libcuda.so.1
libcuda.so (libc6,x86-64) => /usr/lib/x86_64-linux-gnu/libcuda.so
# microk8s enable gpu
Enabling NVIDIA GPU
NVIDIA kernel module detected
dns is already enabled
Applying manifest
daemonset.apps/nvidia-device-plugin-daemonset created
NVIDIA is enabled
# microk8s status
microk8s is running
addons:
dashboard: enabled
dns: enabled
gpu: enabled
metallb: enabled
rbac: enabled
registry: enabled
storage: enabled
cilium: disabled
[...]
# cat cuda-vector-add_test.yaml
apiVersion: v1
kind: Pod
metadata:
name: cuda-vector-add
spec:
restartPolicy: OnFailure
containers:
- name: cuda-vector-add
image: "k8s.gcr.io/cuda-vector-add:v0.1"
resources:
limits:
nvidia.com/gpu: 1
# kubectl create -f cuda-vector-add_test.yaml
pod/cuda-vector-add created
# kubectl get all -A | grep cuda
default pod/cuda-vector-add 0/1 Pending 0 47s
# kubectl describe pod/cuda-vector-add
Name: cuda-vector-add
Namespace: default
Priority: 0
Node: <none>
Labels: <none>
Annotations: <none>
Status: Pending
IP:
IPs: <none>
Containers:
cuda-vector-add:
Image: k8s.gcr.io/cuda-vector-add:v0.1
Port: <none>
Host Port: <none>
Limits:
nvidia.com/gpu: 1
Requests:
nvidia.com/gpu: 1
Environment: <none>
Mounts:
/var/run/secrets/kubernetes.io/serviceaccount from default-token-s9mfz (ro)
Conditions:
Type Status
PodScheduled False
Volumes:
default-token-s9mfz:
Type: Secret (a volume populated by a Secret)
SecretName: default-token-s9mfz
Optional: false
QoS Class: BestEffort
Node-Selectors: <none>
Tolerations: node.kubernetes.io/not-ready:NoExecute for 300s
node.kubernetes.io/unreachable:NoExecute for 300s
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Warning FailedScheduling 2s (x5 over 4m4s) default-scheduler 0/1 nodes are available: 1 Insufficient nvidia.com/gpu.
# kubectl -n kube-system logs pod/nvidia-device-plugin-daemonset-slqpv
2020/05/29 07:59:50 Loading NVML
2020/05/29 07:59:50 Failed to initialize NVML: could not load NVML library.
2020/05/29 07:59:50 If this is a GPU node, did you set the docker default runtime to `nvidia`?
2020/05/29 07:59:50 You can check the prerequisites at: https://github.com/NVIDIA/k8s-device-plugin#prerequisites
2020/05/29 07:59:50 You can learn how to set the runtime at: https://github.com/NVIDIA/k8s-device-plugin#quick-start
Looking further where this may be rooted in, I used microk8s.ctr to try start the pod directly and compare with another ctr/containerd. “microk8s.ctr” using containerd runtime “nvidia-container-runtime” throws a libnvidia-container.so.1 error. In contrary, everything works fine doing the same using “ctr” directly (different ctr/containerd outside microk8s, docker 1.2.5 deb used here)
#microk8s ctr run --rm --gpus 0 docker.io/nvidia/cuda:9.0-base nvidia-smi nvidia-smi
ctr: OCI runtime create failed: container_linux.go:348: starting container process caused "process_linux.go:402: container init caused \"process_linux.go:385: running prestart hook 0 caused \\\"error running hook: exit status 127, stdout: , stderr: /usr/bin/nvidia-container-cli: relocation error: /usr/bin/nvidia-container-cli: symbol nvc_device_mig_caps_mount version NVC_1.0 not defined in file libnvidia-container.so.1 with link time reference\\\\n\\\"\"": unknown
Is the above error related to the issue mentioned on https://github.com/NVIDIA/k8s-device-plugin#prerequisites, or would gpu-enabled pods be runnable in the given setup?
Note that you need to install the nvidia-docker2 package and not the nvidia-container-toolkit. This is because the new --gpus options hasn’t reached kubernetes yet. You will need to enable the nvidia runtime as your default runtime on your node.
The pod starts fine using the non-microk8s ctr/containerd:
#ctr run --rm --gpus 0 docker.io/nvidia/cuda:9.0-base nvidia-smi nvidia-smi
Wed May 27 13:25:29 2020
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 440.82 Driver Version: 440.82 CUDA Version: 10.2 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 GeForce RTX 2060 On | 00000000:07:00.0 Off | N/A |
| 0% 44C P8 18W / 160W | 0MiB / 5934MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| No running processes found |
+-----------------------------------------------------------------------------+
# cat /etc/apt/sources.list.d/nvidia-docker.list
deb https://nvidia.github.io/libnvidia-container/debian10/$(ARCH) /
deb https://nvidia.github.io/nvidia-container-runtime/debian10/$(ARCH) /
deb https://nvidia.github.io/nvidia-docker/debian10/$(ARCH) /
# apt show libcuda1
Package: libcuda1
Version: 440.82-1~bpo10+1
Priority: optional
Section: non-free/libs
Source: nvidia-graphics-drivers
Maintainer: Debian NVIDIA Maintainers <pkg-nvidia-devel@lists.alioth.debian.org>
Installed-Size: 17.0 MB
Provides: libcuda-10.0-1, libcuda-10.1-1, libcuda-10.2-1, libcuda-5.0-1, libcuda-5.5-1, libcuda-6.0-1, libcuda-6.5-1, libcuda-7.0-1, libcuda-7.5-1, libcuda-8.0-1, libcuda-9.0-1, libcuda-9.1-1, libcuda-9.2-1, libcuda.so.1 (= 440.82), libcuda1-any
Pre-Depends: nvidia-legacy-check (>= 396)
Depends: nvidia-support, nvidia-alternative (= 440.82-1~bpo10+1), libnvidia-fatbinaryloader (= 440.82-1~bpo10+1), libc6 (>= 2.7)
Recommends: nvidia-kernel-dkms (= 440.82-1~bpo10+1) | nvidia-kernel-440.82, nvidia-smi, libnvidia-cfg1 (= 440.82-1~bpo10+1), nvidia-persistenced, libcuda1-i386 (= 440.82-1~bpo10+1)
Suggests: nvidia-cuda-mps, nvidia-kernel-dkms (>= 440.82) | nvidia-kernel-source (>= 440.82)
Homepage: https://www.nvidia.com/CUDA
Download-Size: 2,295 kB
APT-Manual-Installed: yes
APT-Sources: http://deb.debian.org/debian buster-backports/non-free amd64 Packages
Description: NVIDIA CUDA Driver Library
# apt show nvidia-container-runtime
Package: nvidia-container-runtime
Version: 3.2.0-1
Priority: optional
Section: utils
Maintainer: NVIDIA CORPORATION <cudatools@nvidia.com>
Installed-Size: 2,021 kB
Depends: nvidia-container-toolkit (>= 1.1.0), nvidia-container-toolkit (<< 2.0.0), libseccomp2
Homepage: https://github.com/NVIDIA/nvidia-container-runtime/wiki
Download-Size: 612 kB
APT-Manual-Installed: yes
APT-Sources: https://nvidia.github.io/nvidia-container-runtime/debian10/amd64 Packages
Description: NVIDIA container runtime
Provides a modified version of runc allowing users to run GPU enabled
containers.
Issue Analytics
- State:
- Created 3 years ago
- Comments:10 (1 by maintainers)
Top GitHub Comments
As far as i know (there’s a good chance i am wrong), microk8s package the nvidia libs as shown in the snapcraft.yaml. These are the libs:
Could it be conflicting with the libs that are installed in the system?
Microk8s in its default install now, all standard. Installed as described in the previous post, with the standard install command:
snap install microk8s --classic --channel=1.18/stable
My /var/snap/microk8s/current/args/kubelet then has this config, that is what I meant with “set so use remote/containerd sock”