question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

NVIDIA drivers not installing on Azure cloud runner

See original GitHub issue

Hi everybody I am trying to use cml-runner on GitLab to deploy a GPU machine on which to run training. The deployment works great but the docker container then running the training can’t find any NVIDIA drivers it seems, as I can’t run ‘nvidia-smi’.

My .gitlab-ci.yml looks like this (simplified):

stages:
  - deploy
  - train

deploy:
  stage: deploy
  when: always
  image: dvcorg/cml:0-dvc2-base1-gpu
  script:
    - cml-runner
      --cloud azure
      --cloud-region eu-west
      --cloud-type Standard_NC4as_T4_v3
      --cloud-hdd-size 128
      --cloud-gpu v100
      --labels=cml-runner-gpu

train:
  stage: train
  when: on_success
  image: dvcorg/cml:0-dvc2-base1-gpu
  tags:
    - cml-runner-gpu

  script:
    - nvidia-smi

In the examples it’s not mentioned that I need to install the drivers myself on the deployed machine, it looks like it should work out-of-the-box, or am I overlooking something? Is that only for AWS? Do I need to pass a script installing the drivers through --cloud-startup-script?

Cheers

Issue Analytics

  • State:closed
  • Created 2 years ago
  • Comments:8 (5 by maintainers)

github_iconTop GitHub Comments

1reaction
MaxHuerlimanncommented, Jun 15, 2021

Hi, the issue is still persisting. If I pass a startup script that installs the drivers it works, though, so like this:

stages:
  - deploy
  - train

deploy:
  stage: deploy
  when: always
  image: dvcorg/cml:0-dvc2-base1-gpu
  script:
    - script=$(echo 'sudo apt-get update && sudo apt-get upgrade && sudo apt-get install -y nvidia-driver-460' | base64 --wrap 0)
    - cml-runner
      --cloud azure
      --cloud-region eu-west
      --cloud-type Standard_NC4as_T4_v3
      --cloud-hdd-size 128
      --cloud-startup-script $script
      --labels=cml-runner-gpu

train:
  stage: train
  when: on_success
  image: dvcorg/cml:0-dvc2-base1-gpu
  tags:
    - cml-runner-gpu

  script:
    - nvidia-smi

I even manually ssh’d to the created machine and could confirm that no NVIDIA drivers were installed if I didn’t pass the above script.

0reactions
MaxHuerlimanncommented, Jun 17, 2021

That’s great! Thanks for the quick fix 😃

Read more comments on GitHub >

github_iconTop Results From Across the Web

Azure N-series NVIDIA GPU driver setup for Windows
Open a command prompt and change to the C:\Program Files\NVIDIA Corporation\NVSMI directory. · Run nvidia-smi . If the driver is installed, you ...
Read more >
Couldn't communicate with the NVIDIA driver - Linux
I started using azure nvidia-gpu-optimized-vmi-a10 vm. But there are no nvidia drivers in that VM and I am unable to install them also....
Read more >
How to install NVIDIA graphics driver on Azure VM
Under advanced settings during the setup of a VM, you can click on 'select an extension to install' which will give the option...
Read more >
How to Setup NVIDIA Driver on NV-Series Azure VM
Download the NVIDIA driver setup file from Azure Blob storage. I put the setup file in blob storage to make sure that this...
Read more >
Installing GPU Drivers on Linux Machines
Be sure to install the specified driver, and not the latest available version. ... To install NVIDIA driver for all other instances, including...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found