question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

GPU addon stops working after upgrading to Ubuntu 21.10 which uses cgroup v2

See original GitHub issue

Please run microk8s inspect and attach the generated tarball to this issue.

$ k get all -n gpu-operator-resources
NAME                                           READY   STATUS                  RESTARTS       AGE
pod/nvidia-device-plugin-validator-6lg6f       0/1     Completed               0              8d
pod/nvidia-cuda-validator-8cmv6                0/1     Completed               0              8d
pod/nvidia-dcgm-exporter-7798m                 0/1     Init:CrashLoopBackOff   7 (106s ago)   9d
pod/nvidia-container-toolkit-daemonset-dlnq9   0/1     Init:CrashLoopBackOff   9 (98s ago)    9d
pod/nvidia-operator-validator-hl8rk            0/1     Init:CrashLoopBackOff   7 (99s ago)    9d
pod/nvidia-device-plugin-daemonset-rjdqf       0/1     Init:CrashLoopBackOff   9 (91s ago)    9d
pod/gpu-feature-discovery-hwq2s                0/1     Init:CrashLoopBackOff   9 (90s ago)    9d

NAME                           TYPE        CLUSTER-IP       EXTERNAL-IP   PORT(S)    AGE
service/nvidia-dcgm-exporter   ClusterIP   10.152.183.145   <none>        9400/TCP   9d

NAME                                                DESIRED   CURRENT   READY   UP-TO-DATE   AVAILABLE   NODE SELECTOR                                      AGE
daemonset.apps/nvidia-mig-manager                   0         0         0       0            0           nvidia.com/gpu.deploy.mig-manager=true             9d
daemonset.apps/nvidia-dcgm-exporter                 1         1         0       1            0           nvidia.com/gpu.deploy.dcgm-exporter=true           9d
daemonset.apps/nvidia-operator-validator            1         1         0       1            0           nvidia.com/gpu.deploy.operator-validator=true      9d
daemonset.apps/nvidia-device-plugin-daemonset       1         1         0       1            0           nvidia.com/gpu.deploy.device-plugin=true           9d
daemonset.apps/nvidia-container-toolkit-daemonset   1         1         0       1            0           nvidia.com/gpu.deploy.container-toolkit=true       9d
daemonset.apps/gpu-feature-discovery                1         1         0       1            0           nvidia.com/gpu.deploy.gpu-feature-discovery=true   9d
$ k describe pod nvidia-device-plugin-daemonset-rjdqf -n gpu-operator-resources
<snip>
  Warning  Failed          12m (x4 over 14m)     kubelet  Error: failed to create containerd task: failed to create shim: OCI runtime create failed: container_linux.go:380: starting container process caused: process_linux.go:545: container init caused: Running hook #0:: error running hook: exit status 1, stdout: , stderr: nvidia-container-cli.real: container error: cgroup subsystem devices not found: unknown

We appreciate your feedback. Thank you for using microk8s inspection-report-20211017_133715.tar.gz .

Issue Analytics

  • State:open
  • Created 2 years ago
  • Comments:7 (6 by maintainers)

github_iconTop GitHub Comments

2reactions
AdamIsraelcommented, Oct 25, 2021

Wouldn’t this hack break other applications which require cgroup v2?

Yes, the workaround disables cgroup v2 as the default, in favour of v1. In my case, none of the other applications run against cgroups v2 so this was a safe change. This only applies if you’re purposely running a version of microk8s prior to 1.22.

It doesn’t sound like my comment applies to your issue; I just wanted to document it for anyone else who runs across this potentially breaking change.

0reactions
stale[bot]commented, Nov 22, 2022

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

Read more comments on GitHub >

github_iconTop Results From Across the Web

GPU addon stops working after upgrading to Ubuntu 21.10 ...
GPU addon stops working after upgrading to Ubuntu 21.10 which uses cgroup v2 #2662. Open. khteh opened this issue on Oct 16, ...
Read more >
after Ubuntu 21.10 upgrade: "cannot attach cgroup program ...
Is it possible that you are using a Linux kernel that doesn't properly support the unified cgroup hierarchy? I had the same problem...
Read more >
Impish Indri Release Notes - Ubuntu Discourse
Update Manager should open up and tell you: "New distribution release '21.10' is available." If not you can also use /usr/lib/ubuntu-release- ...
Read more >
Nvidia drivers are not working properly after upgrading to ...
The i915 driver for your igpu is missing, it should be in the modules package of your kernel. Likely, ubuntu upgrade forgot to...
Read more >
Kernel 5.13 broke my Ubuntu 21.10 on RPI4 8Gb
2 ) data files form 21.10 and use it. So I decide to give ubuntu 21.10 another try, and installed it on a...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found