Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Are GPU-enabled container runnable with containerd runtime?

See original GitHub issue

GPU-enabled pods are failing to start using the gpu addon and trying to stay with containerd (not using docker as default runtime). The cuda-vector-add testing pod remains in pending state and does not start. Looking at the logs of nvidia-device-plugin-daemonset, it also has errors and mentions that the default runtime needs to be changed from containerd to docker.

I would prefer to stay with the containerd runtime and avoid using docker-ce as runtime (nvidia-docker2 depends upon docker-ce), due to other aspects (docker changing iptables rules, which requires a further solution https://github.com/kubernetes/kubernetes/issues/39823#issuecomment-276841124 and https://github.com/ubuntu/microk8s/pull/267)

Seeing reports the k3s is able to run gpu-enabled pods using containerd (https://dev.to/mweibel/add-nvidia-gpu-support-to-k3s-with-containerd-4j17) and that my OS-level containerd is able to run a pod with nvidia-smi I prefer to stay with containerd runtime. Is that somehow possible?

The system on which microk8s is being run on is a Debian Buster 10.4 with Nvidia drivers from Debian Backports and Nvidia docker libraries from nvidia.github.io. Microk8s was installed via snap.

# ldconfig -p | grep cuda
        libicudata.so.65 (libc6,x86-64) => /usr/lib/x86_64-linux-gnu/libicudata.so.65
        libicudata.so.63 (libc6,x86-64) => /usr/lib/x86_64-linux-gnu/libicudata.so.63
        libcudart.so.10.1 (libc6,x86-64) => /usr/lib/x86_64-linux-gnu/libcudart.so.10.1
        libcudart.so (libc6,x86-64) => /usr/lib/x86_64-linux-gnu/libcudart.so
        libcuda.so.1 (libc6,x86-64) => /usr/lib/x86_64-linux-gnu/libcuda.so.1
        libcuda.so (libc6,x86-64) => /usr/lib/x86_64-linux-gnu/libcuda.so

# microk8s enable gpu
Enabling NVIDIA GPU
NVIDIA kernel module detected
dns is already enabled
Applying manifest
daemonset.apps/nvidia-device-plugin-daemonset created
NVIDIA is enabled

# microk8s status
microk8s is running
addons:
dashboard: enabled
dns: enabled
gpu: enabled
metallb: enabled
rbac: enabled
registry: enabled
storage: enabled
cilium: disabled
[...]

# cat cuda-vector-add_test.yaml
apiVersion: v1
kind: Pod
metadata:
  name: cuda-vector-add
spec:
  restartPolicy: OnFailure
  containers:
    - name: cuda-vector-add
      image: "k8s.gcr.io/cuda-vector-add:v0.1"
      resources:
        limits:
          nvidia.com/gpu: 1

# kubectl create -f cuda-vector-add_test.yaml
pod/cuda-vector-add created

# kubectl get all -A | grep cuda
default              pod/cuda-vector-add                                   0/1     Pending   0          47s

# kubectl describe pod/cuda-vector-add
Name:         cuda-vector-add
Namespace:    default
Priority:     0
Node:         <none>
Labels:       <none>
Annotations:  <none>
Status:       Pending
IP:
IPs:          <none>
Containers:
  cuda-vector-add:
    Image:      k8s.gcr.io/cuda-vector-add:v0.1
    Port:       <none>
    Host Port:  <none>
    Limits:
      nvidia.com/gpu:  1
    Requests:
      nvidia.com/gpu:  1
    Environment:       <none>
    Mounts:
      /var/run/secrets/kubernetes.io/serviceaccount from default-token-s9mfz (ro)
Conditions:
  Type           Status
  PodScheduled   False
Volumes:
  default-token-s9mfz:
    Type:        Secret (a volume populated by a Secret)
    SecretName:  default-token-s9mfz
    Optional:    false
QoS Class:       BestEffort
Node-Selectors:  <none>
Tolerations:     node.kubernetes.io/not-ready:NoExecute for 300s
                 node.kubernetes.io/unreachable:NoExecute for 300s
Events:
  Type     Reason            Age                From               Message
  ----     ------            ----               ----               -------
  Warning  FailedScheduling  2s (x5 over 4m4s)  default-scheduler  0/1 nodes are available: 1 Insufficient nvidia.com/gpu.

# kubectl -n kube-system logs pod/nvidia-device-plugin-daemonset-slqpv
2020/05/29 07:59:50 Loading NVML
2020/05/29 07:59:50 Failed to initialize NVML: could not load NVML library.
2020/05/29 07:59:50 If this is a GPU node, did you set the docker default runtime to `nvidia`?
2020/05/29 07:59:50 You can check the prerequisites at: https://github.com/NVIDIA/k8s-device-plugin#prerequisites
2020/05/29 07:59:50 You can learn how to set the runtime at: https://github.com/NVIDIA/k8s-device-plugin#quick-start

Looking further where this may be rooted in, I used microk8s.ctr to try start the pod directly and compare with another ctr/containerd. “microk8s.ctr” using containerd runtime “nvidia-container-runtime” throws a libnvidia-container.so.1 error. In contrary, everything works fine doing the same using “ctr” directly (different ctr/containerd outside microk8s, docker 1.2.5 deb used here)

#microk8s ctr run --rm --gpus 0 docker.io/nvidia/cuda:9.0-base nvidia-smi nvidia-smi
ctr: OCI runtime create failed: container_linux.go:348: starting container process caused "process_linux.go:402: container init caused \"process_linux.go:385: running prestart hook 0 caused \\\"error running hook: exit status 127, stdout: , stderr: /usr/bin/nvidia-container-cli: relocation error: /usr/bin/nvidia-container-cli: symbol nvc_device_mig_caps_mount version NVC_1.0 not defined in file libnvidia-container.so.1 with link time reference\\\\n\\\"\"": unknown

Is the above error related to the issue mentioned on https://github.com/NVIDIA/k8s-device-plugin#prerequisites, or would gpu-enabled pods be runnable in the given setup?

Note that you need to install the nvidia-docker2 package and not the nvidia-container-toolkit. This is because the new --gpus options hasn’t reached kubernetes yet. You will need to enable the nvidia runtime as your default runtime on your node.

The pod starts fine using the non-microk8s ctr/containerd:

#ctr run --rm --gpus 0 docker.io/nvidia/cuda:9.0-base nvidia-smi nvidia-smi
Wed May 27 13:25:29 2020
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 440.82       Driver Version: 440.82       CUDA Version: 10.2     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  GeForce RTX 2060    On   | 00000000:07:00.0 Off |                  N/A |
|  0%   44C    P8    18W / 160W |      0MiB /  5934MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+

# cat /etc/apt/sources.list.d/nvidia-docker.list
deb https://nvidia.github.io/libnvidia-container/debian10/$(ARCH) /
deb https://nvidia.github.io/nvidia-container-runtime/debian10/$(ARCH) /
deb https://nvidia.github.io/nvidia-docker/debian10/$(ARCH) /

# apt show libcuda1
Package: libcuda1
Version: 440.82-1~bpo10+1
Priority: optional
Section: non-free/libs
Source: nvidia-graphics-drivers
Maintainer: Debian NVIDIA Maintainers <pkg-nvidia-devel@lists.alioth.debian.org>
Installed-Size: 17.0 MB
Provides: libcuda-10.0-1, libcuda-10.1-1, libcuda-10.2-1, libcuda-5.0-1, libcuda-5.5-1, libcuda-6.0-1, libcuda-6.5-1, libcuda-7.0-1, libcuda-7.5-1, libcuda-8.0-1, libcuda-9.0-1, libcuda-9.1-1, libcuda-9.2-1, libcuda.so.1 (= 440.82), libcuda1-any
Pre-Depends: nvidia-legacy-check (>= 396)
Depends: nvidia-support, nvidia-alternative (= 440.82-1~bpo10+1), libnvidia-fatbinaryloader (= 440.82-1~bpo10+1), libc6 (>= 2.7)
Recommends: nvidia-kernel-dkms (= 440.82-1~bpo10+1) | nvidia-kernel-440.82, nvidia-smi, libnvidia-cfg1 (= 440.82-1~bpo10+1), nvidia-persistenced, libcuda1-i386 (= 440.82-1~bpo10+1)
Suggests: nvidia-cuda-mps, nvidia-kernel-dkms (>= 440.82) | nvidia-kernel-source (>= 440.82)
Homepage: https://www.nvidia.com/CUDA
Download-Size: 2,295 kB
APT-Manual-Installed: yes
APT-Sources: http://deb.debian.org/debian buster-backports/non-free amd64 Packages
Description: NVIDIA CUDA Driver Library

# apt show nvidia-container-runtime
Package: nvidia-container-runtime
Version: 3.2.0-1
Priority: optional
Section: utils
Maintainer: NVIDIA CORPORATION <cudatools@nvidia.com>
Installed-Size: 2,021 kB
Depends: nvidia-container-toolkit (>= 1.1.0), nvidia-container-toolkit (<< 2.0.0), libseccomp2
Homepage: https://github.com/NVIDIA/nvidia-container-runtime/wiki
Download-Size: 612 kB
APT-Manual-Installed: yes
APT-Sources: https://nvidia.github.io/nvidia-container-runtime/debian10/amd64  Packages
Description: NVIDIA container runtime
 Provides a modified version of runc allowing users to run GPU enabled
 containers.

Issue Analytics

State:
Created 3 years ago
Comments:10 (1 by maintainers)

Top GitHub Comments

3reactions

balchuacommented, May 29, 2020

As far as i know (there’s a good chance i am wrong), microk8s package the nvidia libs as shown in the snapcraft.yaml. These are the libs:

Could it be conflicting with the libs that are installed in the system?

1reaction

Bamfaxcommented, May 29, 2020

Microk8s in its default install now, all standard. Installed as described in the previous post, with the standard install command: snap install microk8s --classic --channel=1.18/stable

My /var/snap/microk8s/current/args/kubelet then has this config, that is what I meant with “set so use remote/containerd sock”

--container-runtime=remote
--container-runtime-endpoint=${SNAP_COMMON}/run/containerd.sock
--containerd=${SNAP_COMMON}/run/containerd.sock

Top Results From Across the Web

Are GPU-enabled container runnable with containerd runtime?

GPU -enabled pods are failing to start using the gpu addon and trying to stay with containerd (not using docker as default runtime)....

Enabling GPUs in the Container Runtime Ecosystem

At container creation time, the prestart hook checks whether the container is GPU-enabled (using environment variables) and uses the container ...

Running Containerd with Nvidia GPU support

The nvidia-container-runtime is a patched version of runc that adds a custom pre-start hook, which enables GPU support from within the container ......

Running GPU Enabled Containers in Kubernetes Cluster

Now you can run gpu container with the resource name nvidia.com/gpu: Note that GPUs are only supposed to be specified in the resources.limits....

How to enable NVIDIA GPUs in containers on bare metal in ...

To run NVIDIA containers contained and not privileged, we have to install an SELinux policy tailored for running CUDA GPU workloads. The policy ......