Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Can't enable gpu on Nvidia DGX A100

See original GitHub issue

This is a follow-up to the issue in https://github.com/ubuntu/microk8s/issues/2115

As described in that issue, version 1.21/beta of microk8s seems to work better to enable gpu. However, the instructions that work on Ubuntu 20.04 on a g3.4xlarge instance on AWS don’t work on an Nvidia DGX A100 machine.

sudo snap install microk8s --channel=1.21/beta --classic microk8s enable gpu

I get the following pod in Init:CrashLoopBackOff:

ubuntu@blanka:~$ microk8s kubectl get pods -A
NAMESPACE                NAME                                                         READY   STATUS                  RESTARTS   AGE
kube-system              calico-node-m76km                                            1/1     Running                 0          16h
kube-system              coredns-86f78bb79c-bhl86                                     1/1     Running                 0          16h
kube-system              calico-kube-controllers-847c8c99d-fc48p                      1/1     Running                 0          16h
default                  gpu-operator-65d474cc8-g8gdp                                 1/1     Running                 0          16h
default                  gpu-operator-node-feature-discovery-worker-777t6             1/1     Running                 0          15h
default                  gpu-operator-node-feature-discovery-master-dcf999dc8-n5fk2   1/1     Running                 0          15h
gpu-operator-resources   nvidia-driver-daemonset-ndlds                                1/1     Running                 0          15h
gpu-operator-resources   nvidia-container-toolkit-daemonset-xwlbn                     1/1     Running                 0          15h
gpu-operator-resources   nvidia-device-plugin-daemonset-lx5j4                         0/1     Init:CrashLoopBackOff   186        15h

I haven’t been able to find useful information yet. Here’s a kubectl describe and kubectl logs (with no logs):

ubuntu@blanka:~$ microk8s kubectl describe pod nvidia-device-plugin-daemonset-lx5j4 -n gpu-operator-resources
Name:         nvidia-device-plugin-daemonset-lx5j4
Namespace:    gpu-operator-resources
Priority:     0
Node:         blanka/10.229.66.23
Start Time:   Mon, 22 Mar 2021 21:16:59 +0000
Labels:       app=nvidia-device-plugin-daemonset
              controller-revision-hash=b479cc95
              pod-template-generation=1
Annotations:  cni.projectcalico.org/podIP: 10.1.234.10/32
              cni.projectcalico.org/podIPs: 10.1.234.10/32
              scheduler.alpha.kubernetes.io/critical-pod: 
Status:       Pending
IP:           10.1.234.10
IPs:
  IP:           10.1.234.10
Controlled By:  DaemonSet/nvidia-device-plugin-daemonset
Init Containers:
  toolkit-validation:
    Container ID:  containerd://580626297564adebebda2a69cc4172fcff6edab3e734ca5ac2134c48798bc88b
    Image:         nvcr.io/nvidia/k8s/cuda-sample:vectoradd-cuda10.2
    Image ID:      nvcr.io/nvidia/k8s/cuda-sample@sha256:4593078cdb8e786d35566faa2b84da1123acea42f0d4099e84e2af0448724af1
    Port:          <none>
    Host Port:     <none>
    Command:
      sh
      -c
    Args:
      /tmp/vectorAdd
    State:          Waiting
      Reason:       CrashLoopBackOff
    Last State:     Terminated
      Reason:       Error
      Exit Code:    1
      Started:      Tue, 23 Mar 2021 12:07:43 +0000
      Finished:     Tue, 23 Mar 2021 12:07:43 +0000
    Ready:          False
    Restart Count:  179
    Environment:    <none>
    Mounts:
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-h7b7h (ro)
Containers:
  nvidia-device-plugin-ctr:
    Container ID:  
    Image:         nvcr.io/nvidia/k8s-device-plugin:v0.8.2-ubi8
    Image ID:      
    Port:          <none>
    Host Port:     <none>
    Args:
      --mig-strategy=single
      --pass-device-specs=true
      --fail-on-init-error=true
      --device-list-strategy=envvar
      --nvidia-driver-root=/run/nvidia/driver
    State:          Waiting
      Reason:       PodInitializing
    Ready:          False
    Restart Count:  0
    Environment:
      NVIDIA_VISIBLE_DEVICES:      all
      NVIDIA_DRIVER_CAPABILITIES:  all
    Mounts:
      /var/lib/kubelet/device-plugins from device-plugin (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-h7b7h (ro)
Conditions:
  Type              Status
  Initialized       False 
  Ready             False 
  ContainersReady   False 
  PodScheduled      True 
Volumes:
  device-plugin:
    Type:          HostPath (bare host directory volume)
    Path:          /var/lib/kubelet/device-plugins
    HostPathType:  
  kube-api-access-h7b7h:
    Type:                    Projected (a volume that contains injected data from multiple sources)
    TokenExpirationSeconds:  3607
    ConfigMapName:           kube-root-ca.crt
    ConfigMapOptional:       <nil>
    DownwardAPI:             true
QoS Class:                   BestEffort
Node-Selectors:              nvidia.com/gpu.present=true
Tolerations:                 CriticalAddonsOnly op=Exists
                             node.kubernetes.io/disk-pressure:NoSchedule op=Exists
                             node.kubernetes.io/memory-pressure:NoSchedule op=Exists
                             node.kubernetes.io/not-ready:NoExecute op=Exists
                             node.kubernetes.io/pid-pressure:NoSchedule op=Exists
                             node.kubernetes.io/unreachable:NoExecute op=Exists
                             node.kubernetes.io/unschedulable:NoSchedule op=Exists
                             nvidia.com/gpu:NoSchedule op=Exists
Events:
  Type     Reason   Age                    From     Message
  ----     ------   ----                   ----     -------
  Normal   Pulled   34m (x173 over 14h)    kubelet  Container image "nvcr.io/nvidia/k8s/cuda-sample:vectoradd-cuda10.2" already present on machine
  Warning  BackOff  4m6s (x4093 over 14h)  kubelet  Back-off restarting failed container

ubuntu@blanka:~$ microk8s kubectl logs nvidia-device-plugin-daemonset-lx5j4 -n gpu-operator-resources
Error from server (BadRequest): container "nvidia-device-plugin-ctr" in pod "nvidia-device-plugin-daemonset-lx5j4" is waiting to start: PodInitializing

There are no nvidia drivers or cuda packages installed on the machine and never were (fresh MAAS deployment):

ubuntu@ip-172-31-14-39:~$ dpkg -l | grep -i -e nvidia -e cuda
ubuntu@ip-172-31-14-39:~$

Issue Analytics

State:
Created 2 years ago
Comments:5

Top GitHub Comments

1reaction

davecore82commented, Mar 23, 2021

I found a recipe that works for me. I’m using microk8s on Ubuntu 20.04, here’s what I had to do to make this work on both a Nvidia DGX A100 as well as a ProLiant DL380 Gen10 machine with T4 GPU.

Make sure to start on a fresh Ubuntu 20.04 with no nvidia drivers (or apt purge the nvidia packages if you can’t)
blacklist the nouveau driver (add modprobe.blacklist=nouveau nouveau.modeset=0 to GRUB_CMDLINE_LINUX_DEFAULT in /etc/default/grub and run sudo update-grub and reboot)
[for the A100 only] Install the package nvidia-fabricmanager-460 from the cuda repos (you won’t be able to enable the systemd service until the K8s GPU operator has loaded the drivers)
Install microk8s with the 1.21/beta channel (I have version v1.21.0-beta.1)
microk8s enable dns (tip: make sure your DNS is working by launching a test pod and resolving internal and external hostnames)
microk8s enable gpu
[for the A100 only] enable nvidia-fabricmanager (sudo systemctl --now enable nvidia-fabricmanager)

0reactions

stale[bot]commented, Nov 22, 2022

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

Top Results From Across the Web

Can't enable gpu on Nvidia DGX A100 · Issue #2119

This is a follow-up to the issue in #2115 As described in that issue, version 1.21/beta of microk8s seems to work better to...

A100 PCIe in Disabled* mode - DGX User Forum

After I ran my work successfully I quit MIG mode, The GPU shows it is in a “Disabled*” mode in MIG. Info from...

NVIDIA DGX A100 User Guide

A100 Tensor Core GPUs. This document is for users and administrators of the DGX A100 system. 1.1. Hardware Overview. This section provides information...

DGX Station A100 User Guide

DGX Station A100 User Guide explains how to install, set up, and maintain the NVIDIA DGX Station™ A100 . This guide is aimed...

A100 PCIe isn't recognized by BIOS - DGX User Forum

It says: Enable Resizable Base-Address Register (Resizable-BAR) option to enhance GPU performance. But the card is still not recognized in BIOS ...