Can't enable gpu on Nvidia DGX A100
See original GitHub issueThis is a follow-up to the issue in https://github.com/ubuntu/microk8s/issues/2115
As described in that issue, version 1.21/beta of microk8s seems to work better to enable gpu. However, the instructions that work on Ubuntu 20.04 on a g3.4xlarge instance on AWS don’t work on an Nvidia DGX A100 machine.
sudo snap install microk8s --channel=1.21/beta --classic microk8s enable gpu
I get the following pod in Init:CrashLoopBackOff:
ubuntu@blanka:~$ microk8s kubectl get pods -A
NAMESPACE NAME READY STATUS RESTARTS AGE
kube-system calico-node-m76km 1/1 Running 0 16h
kube-system coredns-86f78bb79c-bhl86 1/1 Running 0 16h
kube-system calico-kube-controllers-847c8c99d-fc48p 1/1 Running 0 16h
default gpu-operator-65d474cc8-g8gdp 1/1 Running 0 16h
default gpu-operator-node-feature-discovery-worker-777t6 1/1 Running 0 15h
default gpu-operator-node-feature-discovery-master-dcf999dc8-n5fk2 1/1 Running 0 15h
gpu-operator-resources nvidia-driver-daemonset-ndlds 1/1 Running 0 15h
gpu-operator-resources nvidia-container-toolkit-daemonset-xwlbn 1/1 Running 0 15h
gpu-operator-resources nvidia-device-plugin-daemonset-lx5j4 0/1 Init:CrashLoopBackOff 186 15h
I haven’t been able to find useful information yet. Here’s a kubectl describe and kubectl logs (with no logs):
ubuntu@blanka:~$ microk8s kubectl describe pod nvidia-device-plugin-daemonset-lx5j4 -n gpu-operator-resources
Name: nvidia-device-plugin-daemonset-lx5j4
Namespace: gpu-operator-resources
Priority: 0
Node: blanka/10.229.66.23
Start Time: Mon, 22 Mar 2021 21:16:59 +0000
Labels: app=nvidia-device-plugin-daemonset
controller-revision-hash=b479cc95
pod-template-generation=1
Annotations: cni.projectcalico.org/podIP: 10.1.234.10/32
cni.projectcalico.org/podIPs: 10.1.234.10/32
scheduler.alpha.kubernetes.io/critical-pod:
Status: Pending
IP: 10.1.234.10
IPs:
IP: 10.1.234.10
Controlled By: DaemonSet/nvidia-device-plugin-daemonset
Init Containers:
toolkit-validation:
Container ID: containerd://580626297564adebebda2a69cc4172fcff6edab3e734ca5ac2134c48798bc88b
Image: nvcr.io/nvidia/k8s/cuda-sample:vectoradd-cuda10.2
Image ID: nvcr.io/nvidia/k8s/cuda-sample@sha256:4593078cdb8e786d35566faa2b84da1123acea42f0d4099e84e2af0448724af1
Port: <none>
Host Port: <none>
Command:
sh
-c
Args:
/tmp/vectorAdd
State: Waiting
Reason: CrashLoopBackOff
Last State: Terminated
Reason: Error
Exit Code: 1
Started: Tue, 23 Mar 2021 12:07:43 +0000
Finished: Tue, 23 Mar 2021 12:07:43 +0000
Ready: False
Restart Count: 179
Environment: <none>
Mounts:
/var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-h7b7h (ro)
Containers:
nvidia-device-plugin-ctr:
Container ID:
Image: nvcr.io/nvidia/k8s-device-plugin:v0.8.2-ubi8
Image ID:
Port: <none>
Host Port: <none>
Args:
--mig-strategy=single
--pass-device-specs=true
--fail-on-init-error=true
--device-list-strategy=envvar
--nvidia-driver-root=/run/nvidia/driver
State: Waiting
Reason: PodInitializing
Ready: False
Restart Count: 0
Environment:
NVIDIA_VISIBLE_DEVICES: all
NVIDIA_DRIVER_CAPABILITIES: all
Mounts:
/var/lib/kubelet/device-plugins from device-plugin (rw)
/var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-h7b7h (ro)
Conditions:
Type Status
Initialized False
Ready False
ContainersReady False
PodScheduled True
Volumes:
device-plugin:
Type: HostPath (bare host directory volume)
Path: /var/lib/kubelet/device-plugins
HostPathType:
kube-api-access-h7b7h:
Type: Projected (a volume that contains injected data from multiple sources)
TokenExpirationSeconds: 3607
ConfigMapName: kube-root-ca.crt
ConfigMapOptional: <nil>
DownwardAPI: true
QoS Class: BestEffort
Node-Selectors: nvidia.com/gpu.present=true
Tolerations: CriticalAddonsOnly op=Exists
node.kubernetes.io/disk-pressure:NoSchedule op=Exists
node.kubernetes.io/memory-pressure:NoSchedule op=Exists
node.kubernetes.io/not-ready:NoExecute op=Exists
node.kubernetes.io/pid-pressure:NoSchedule op=Exists
node.kubernetes.io/unreachable:NoExecute op=Exists
node.kubernetes.io/unschedulable:NoSchedule op=Exists
nvidia.com/gpu:NoSchedule op=Exists
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal Pulled 34m (x173 over 14h) kubelet Container image "nvcr.io/nvidia/k8s/cuda-sample:vectoradd-cuda10.2" already present on machine
Warning BackOff 4m6s (x4093 over 14h) kubelet Back-off restarting failed container
ubuntu@blanka:~$ microk8s kubectl logs nvidia-device-plugin-daemonset-lx5j4 -n gpu-operator-resources
Error from server (BadRequest): container "nvidia-device-plugin-ctr" in pod "nvidia-device-plugin-daemonset-lx5j4" is waiting to start: PodInitializing
There are no nvidia drivers or cuda packages installed on the machine and never were (fresh MAAS deployment):
ubuntu@ip-172-31-14-39:~$ dpkg -l | grep -i -e nvidia -e cuda
ubuntu@ip-172-31-14-39:~$
Issue Analytics
- State:
- Created 2 years ago
- Comments:5
Top Results From Across the Web
Can't enable gpu on Nvidia DGX A100 · Issue #2119
This is a follow-up to the issue in #2115 As described in that issue, version 1.21/beta of microk8s seems to work better to...
Read more >A100 PCIe in Disabled* mode - DGX User Forum
After I ran my work successfully I quit MIG mode, The GPU shows it is in a “Disabled*” mode in MIG. Info from...
Read more >NVIDIA DGX A100 User Guide
A100 Tensor Core GPUs. This document is for users and administrators of the DGX A100 system. 1.1. Hardware Overview. This section provides information...
Read more >DGX Station A100 User Guide
DGX Station A100 User Guide explains how to install, set up, and maintain the NVIDIA DGX Station™ A100 . This guide is aimed...
Read more >A100 PCIe isn't recognized by BIOS - DGX User Forum
It says: Enable Resizable Base-Address Register (Resizable-BAR) option to enhance GPU performance. But the card is still not recognized in BIOS ...
Read more >Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start FreeTop Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Top GitHub Comments
I found a recipe that works for me. I’m using microk8s on Ubuntu 20.04, here’s what I had to do to make this work on both a Nvidia DGX A100 as well as a ProLiant DL380 Gen10 machine with T4 GPU.
apt purge
the nvidia packages if you can’t)modprobe.blacklist=nouveau nouveau.modeset=0
toGRUB_CMDLINE_LINUX_DEFAULT
in/etc/default/grub
and runsudo update-grub
and reboot)nvidia-fabricmanager-460
from the cuda repos (you won’t be able to enable the systemd service until the K8s GPU operator has loaded the drivers)microk8s enable dns
(tip: make sure your DNS is working by launching a test pod and resolving internal and external hostnames)microk8s enable gpu
sudo systemctl --now enable nvidia-fabricmanager
)This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.