question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

CML pipelines are failing since yesterday due to `nvml error: driver not loaded: unkown`

See original GitHub issue

Our pipelines that rely on the dvcorg/cml image have just started to fail in the last 48 hours. It’s happened in two branches which are unrelated and we’ve not made any changes to the cml.yaml in between a pipeline passing and a pipeline failing. Any help would be greatly appreciated!

cml yaml:

name: train and eveluate rasa model

on:
  pull_request:
    types: [opened, synchronize]
  workflow_dispatch:

jobs:
  deploy-runner:
    runs-on: [ubuntu-latest]
    steps:
      - uses: actions/checkout@v2
      - uses: iterative/setup-cml@v1

      - name: deploy
        shell: bash
        env:
          REPO_TOKEN: ${{ secrets.PERSONAL_ACCESS_TOKEN }}
          AWS_ACCESS_KEY_ID: ${{ secrets.AWS_ACCESS_KEY_ID }}
          AWS_SECRET_ACCESS_KEY: ${{ secrets.AWS_SECRET_ACCESS_KEY }}
        run: |
          cml-runner \
          --cloud aws \
          --cloud-region eu-west \
          --cloud-type=c5a.4xlarge \
          --cloud-spot true \
          --labels=cml-runner,voice-control,oms-rasa-2 \
          --idle-timeout 60
  model-training:
    needs: deploy-runner
    runs-on: [self-hosted,cml-runner]
    container: docker://dvcorg/cml

    steps:
    - uses: actions/checkout@v2
      with: 
        ref: ${{ github.event.pull_request.head.sha }}

    - uses: actions/setup-python@v2
      with:
        python-version: '3.8.5'
    - name: Install dependencies
      run: |
        apt-get update -y
        apt-get install make python3-pip virtualenv curl
    - name: cml
      env:
        REPO_TOKEN: ${{ secrets.PERSONAL_ACCESS_TOKEN }}
        AWS_ACCESS_KEY_ID: ${{ secrets.AWS_ACCESS_KEY_ID }}
        AWS_SECRET_ACCESS_KEY: ${{ secrets.AWS_SECRET_ACCESS_KEY }}
        AWS_REGION: eu-west-1
      run: |
        python --version
        make virtualenv
        dvc repro
        echo "## Metrics" > report.md
        git fetch --prune
        dvc metrics diff main --show-md | grep "Change\|\-\-\-" >> report.md
        dvc metrics diff main --show-md | grep "weighted" | sort >> report.md
        sed "s/results\///g" -i report.md
        cml-send-comment report.md
        dvc push
        
    - uses: EndBug/add-and-commit@v7
      if: ${{ github.ref != 'refs/heads/main' }} && ${{ github.ref != 'refs/heads/rasax/prod' }}
      with:
         add: 'dvc.lock --force'
         message: 'chg: dvc repro'

log:

2021-06-29T13:50:10.4642367Z Found online and idle self-hosted runner(s) in the current repository that matches the required labels: 'self-hosted , cml-runner'
2021-06-29T13:50:10.4642468Z Waiting for a self-hosted runner to pick up this job...
2021-06-29T13:50:33.0597577Z Current runner version: '2.278.0'
2021-06-29T13:50:33.0600923Z Runner name: 'cml-6durntj35n'
2021-06-29T13:50:33.0601602Z Runner group name: 'Default'
2021-06-29T13:50:33.0602645Z Machine name: <IP>
2021-06-29T13:50:33.0606044Z ##[group]GITHUB_TOKEN Permissions
2021-06-29T13:50:33.0607140Z Actions: write
2021-06-29T13:50:33.0607662Z Checks: write
2021-06-29T13:50:33.0608173Z Contents: write
2021-06-29T13:50:33.0608722Z Deployments: write
2021-06-29T13:50:33.0609303Z Discussions: write
2021-06-29T13:50:33.0609832Z Issues: write
2021-06-29T13:50:33.0610320Z Metadata: read
2021-06-29T13:50:33.0610849Z Packages: write
2021-06-29T13:50:33.0611423Z PullRequests: write
2021-06-29T13:50:33.0612071Z RepositoryProjects: write
2021-06-29T13:50:33.0612692Z SecurityEvents: write
2021-06-29T13:50:33.0613243Z Statuses: write
2021-06-29T13:50:33.0613842Z ##[endgroup]
2021-06-29T13:50:33.0616211Z Prepare workflow directory
2021-06-29T13:50:33.2592890Z Prepare all required actions
2021-06-29T13:50:33.2602665Z Getting action download info
2021-06-29T13:50:33.7054655Z Download action repository 'actions/checkout@v2'
2021-06-29T13:50:34.7494321Z Download action repository 'actions/setup-python@v2'
2021-06-29T13:50:35.2281198Z Download action repository 'EndBug/add-and-commit@v7'
2021-06-29T13:50:35.9396893Z ##[group]Checking docker version
2021-06-29T13:50:35.9402035Z ##[command]/usr/bin/docker version --format '{{.Server.APIVersion}}'
2021-06-29T13:50:35.9982563Z '1.41'
2021-06-29T13:50:36.0007485Z Docker daemon API version: '1.41'
2021-06-29T13:50:36.0008239Z ##[command]/usr/bin/docker version --format '{{.Client.APIVersion}}'
2021-06-29T13:50:36.0356821Z '1.41'
2021-06-29T13:50:36.0378773Z Docker client API version: '1.41'
2021-06-29T13:50:36.0384407Z ##[endgroup]
2021-06-29T13:50:36.0385071Z ##[group]Clean up resources from previous jobs
2021-06-29T13:50:36.0388671Z ##[command]/usr/bin/docker ps --all --quiet --no-trunc --filter "label=5b8d47"
2021-06-29T13:50:36.0682101Z ##[command]/usr/bin/docker network prune --force --filter "label=5b8d47"
2021-06-29T13:50:36.0970266Z ##[endgroup]
2021-06-29T13:50:36.0970763Z ##[group]Create local container network
2021-06-29T13:50:36.0975563Z ##[command]/usr/bin/docker network create --label 5b8d47 github_network_10190d4fbfea416f8ed6edbf70ead6e2
2021-06-29T13:50:36.1517298Z e415a96aaaa077510f29866e9125f315d2a2625f488e734dd6ac1cf2ff7c7844
2021-06-29T13:50:36.1545099Z ##[endgroup]
2021-06-29T13:50:36.1550518Z ##[group]Starting job container
2021-06-29T13:50:36.1555348Z ##[command]/usr/bin/docker pull dvcorg/cml
2021-06-29T13:50:36.1823653Z Using default tag: latest
2021-06-29T13:50:37.4601146Z latest: Pulling from dvcorg/cml
2021-06-29T13:50:37.4608556Z 6e0aa5e7af40: Pulling fs layer
2021-06-29T13:50:37.4611124Z d47239a868b3: Pulling fs layer
2021-06-29T13:50:37.4612690Z 49cbb10cca85: Pulling fs layer
2021-06-29T13:50:37.4614549Z 4450dd082e0f: Pulling fs layer
2021-06-29T13:50:37.4616367Z b4bc5dc4c4f3: Pulling fs layer
2021-06-29T13:50:37.4618852Z 5353957e2ca6: Pulling fs layer
2021-06-29T13:50:37.4621313Z f91e05a16062: Pulling fs layer
2021-06-29T13:50:37.4622639Z aaf867d3c0de: Pulling fs layer
2021-06-29T13:50:37.4623317Z c08f0dda78de: Pulling fs layer
2021-06-29T13:50:37.4623879Z b8583ef8d926: Pulling fs layer
2021-06-29T13:50:37.4624460Z e54aabf399ec: Pulling fs layer
2021-06-29T13:50:37.4625078Z 31c8c8564309: Pulling fs layer
2021-06-29T13:50:37.4625581Z 0f13ac379859: Pulling fs layer
2021-06-29T13:50:37.4626090Z d06a8d5e22bf: Pulling fs layer
2021-06-29T13:50:37.4626623Z 40eef28bd265: Pulling fs layer
2021-06-29T13:50:37.4627136Z 38c79672cf4c: Pulling fs layer
2021-06-29T13:50:37.4627664Z c9aa58265f49: Pulling fs layer
2021-06-29T13:50:37.4628157Z f337545810eb: Pulling fs layer
2021-06-29T13:50:37.4629025Z 48307dacf7eb: Pulling fs layer
2021-06-29T13:50:37.4629527Z aaf867d3c0de: Waiting
2021-06-29T13:50:37.4629976Z 4450dd082e0f: Waiting
2021-06-29T13:50:37.4630441Z 31c8c8564309: Waiting
2021-06-29T13:50:37.4630924Z b4bc5dc4c4f3: Waiting
2021-06-29T13:50:37.4631410Z 0f13ac379859: Waiting
2021-06-29T13:50:37.4632045Z c08f0dda78de: Waiting
2021-06-29T13:50:37.4632477Z b8583ef8d926: Waiting
2021-06-29T13:50:37.4632901Z d06a8d5e22bf: Waiting
2021-06-29T13:50:37.4633365Z e54aabf399ec: Waiting
2021-06-29T13:50:37.4633784Z 40eef28bd265: Waiting
2021-06-29T13:50:37.4634193Z 38c79672cf4c: Waiting
2021-06-29T13:50:37.4634583Z f337545810eb: Waiting
2021-06-29T13:50:37.4634976Z f91e05a16062: Waiting
2021-06-29T13:50:37.4635986Z 48307dacf7eb: Waiting
2021-06-29T13:50:37.7598229Z d47239a868b3: Verifying Checksum
2021-06-29T13:50:37.7598810Z d47239a868b3: Download complete
2021-06-29T13:50:37.7728047Z 49cbb10cca85: Download complete
2021-06-29T13:50:37.9112011Z 6e0aa5e7af40: Verifying Checksum
2021-06-29T13:50:37.9112685Z 6e0aa5e7af40: Download complete
2021-06-29T13:50:38.1247913Z b4bc5dc4c4f3: Verifying Checksum
2021-06-29T13:50:38.1249804Z b4bc5dc4c4f3: Download complete
2021-06-29T13:50:38.1356853Z 4450dd082e0f: Verifying Checksum
2021-06-29T13:50:38.1357433Z 4450dd082e0f: Download complete
2021-06-29T13:50:38.2198289Z 5353957e2ca6: Verifying Checksum
2021-06-29T13:50:38.2200496Z 5353957e2ca6: Download complete
2021-06-29T13:50:38.4378184Z f91e05a16062: Verifying Checksum
2021-06-29T13:50:38.4378728Z f91e05a16062: Download complete
2021-06-29T13:50:38.4524171Z aaf867d3c0de: Verifying Checksum
2021-06-29T13:50:38.4525131Z aaf867d3c0de: Download complete
2021-06-29T13:50:39.0487640Z c08f0dda78de: Verifying Checksum
2021-06-29T13:50:39.0488454Z c08f0dda78de: Download complete
2021-06-29T13:50:39.0820413Z e54aabf399ec: Verifying Checksum
2021-06-29T13:50:39.0821693Z e54aabf399ec: Download complete
2021-06-29T13:50:39.4000505Z 6e0aa5e7af40: Pull complete
2021-06-29T13:50:39.4620641Z d47239a868b3: Pull complete
2021-06-29T13:50:39.5959779Z 49cbb10cca85: Pull complete
2021-06-29T13:50:40.1295356Z 4450dd082e0f: Pull complete
2021-06-29T13:50:40.1541323Z 0f13ac379859: Verifying Checksum
2021-06-29T13:50:40.1543102Z 0f13ac379859: Download complete
2021-06-29T13:50:40.4801541Z b8583ef8d926: Verifying Checksum
2021-06-29T13:50:40.4802452Z b8583ef8d926: Download complete
2021-06-29T13:50:40.5676446Z b4bc5dc4c4f3: Pull complete
2021-06-29T13:50:40.6766344Z 5353957e2ca6: Pull complete
2021-06-29T13:50:40.7359703Z f91e05a16062: Pull complete
2021-06-29T13:50:40.8835826Z aaf867d3c0de: Pull complete
2021-06-29T13:50:40.9633807Z 40eef28bd265: Verifying Checksum
2021-06-29T13:50:40.9634638Z 40eef28bd265: Download complete
2021-06-29T13:50:41.2653364Z 38c79672cf4c: Verifying Checksum
2021-06-29T13:50:41.2653929Z 38c79672cf4c: Download complete
2021-06-29T13:50:41.6396561Z c9aa58265f49: Verifying Checksum
2021-06-29T13:50:41.6397431Z c9aa58265f49: Download complete
2021-06-29T13:50:42.0421498Z f337545810eb: Verifying Checksum
2021-06-29T13:50:42.0422042Z f337545810eb: Download complete
2021-06-29T13:50:42.2896654Z d06a8d5e22bf: Verifying Checksum
2021-06-29T13:50:42.2897298Z d06a8d5e22bf: Download complete
2021-06-29T13:50:43.0043357Z 48307dacf7eb: Verifying Checksum
2021-06-29T13:50:45.1234288Z 31c8c8564309: Verifying Checksum
2021-06-29T13:50:45.1235101Z 31c8c8564309: Download complete
2021-06-29T13:50:45.5846392Z c08f0dda78de: Pull complete
2021-06-29T13:50:56.0230226Z b8583ef8d926: Pull complete
2021-06-29T13:50:58.5169629Z e54aabf399ec: Pull complete
2021-06-29T13:51:21.3108610Z 31c8c8564309: Pull complete
2021-06-29T13:51:26.0880182Z 0f13ac379859: Pull complete
2021-06-29T13:51:35.2518615Z d06a8d5e22bf: Pull complete
2021-06-29T13:51:38.0047445Z 40eef28bd265: Pull complete
2021-06-29T13:51:39.5772909Z 38c79672cf4c: Pull complete
2021-06-29T13:51:41.1517455Z c9aa58265f49: Pull complete
2021-06-29T13:51:42.9430233Z f337545810eb: Pull complete
2021-06-29T13:51:48.7808466Z 48307dacf7eb: Pull complete
2021-06-29T13:51:49.7515557Z Digest: sha256:1d92361a7ead9f3895d7388743ba839ffa52dd99dc6f760d5a9f7bccfc95f754
2021-06-29T13:51:50.2130319Z Status: Downloaded newer image for dvcorg/cml:latest
2021-06-29T13:51:50.3332050Z docker.io/dvcorg/cml:latest
2021-06-29T13:51:50.3392036Z ##[command]/usr/bin/docker create --name eea93c7cf2ee422895de1ef5f0471cb6_dvcorgcml_68f7cd --label 5b8d47 --workdir /__w/oms-rasa-2/oms-rasa-2 --network github_network_10190d4fbfea416f8ed6edbf70ead6e2  -e "HOME=/github/home" -e GITHUB_ACTIONS=true -e CI=true -v "/var/run/docker.sock":"/var/run/docker.sock" -v "/tmp/tmp.FWEBP47lVs/.cml/cml-00jgckdzfn/_work":"/__w" -v "/tmp/tmp.FWEBP47lVs/.cml/cml-00jgckdzfn/externals":"/__e":ro -v "/tmp/tmp.FWEBP47lVs/.cml/cml-00jgckdzfn/_work/_temp":"/__w/_temp" -v "/tmp/tmp.FWEBP47lVs/.cml/cml-00jgckdzfn/_work/_actions":"/__w/_actions" -v "/tmp/tmp.FWEBP47lVs/.cml/cml-00jgckdzfn/_work/_tool":"/__w/_tool" -v "/tmp/tmp.FWEBP47lVs/.cml/cml-00jgckdzfn/_work/_temp/_github_home":"/github/home" -v "/tmp/tmp.FWEBP47lVs/.cml/cml-00jgckdzfn/_work/_temp/_github_workflow":"/github/workflow" --entrypoint "tail" dvcorg/cml "-f" "/dev/null"
2021-06-29T13:52:30.1756032Z 6bee632c4d05aa60425a277e8d3d8d251776c81d676eeb8866b2668427b57293
2021-06-29T13:52:30.1795921Z ##[command]/usr/bin/docker start 6bee632c4d05aa60425a277e8d3d8d251776c81d676eeb8866b2668427b57293
2021-06-29T13:52:31.3270142Z Error response from daemon: OCI runtime create failed: container_linux.go:380: starting container process caused: process_linux.go:545: container init caused: Running hook #1:: error running hook: exit status 1, stdout: , stderr: nvidia-container-cli: initialization error: nvml error: driver not loaded: unknown
2021-06-29T13:52:31.3272131Z Error: failed to start containers: 6bee632c4d05aa60425a277e8d3d8d251776c81d676eeb8866b2668427b57293
2021-06-29T13:52:31.3388737Z ##[error]Docker start fail with exit code 1
2021-06-29T13:52:31.3525156Z Stop and remove container: eea93c7cf2ee422895de1ef5f0471cb6_dvcorgcml_68f7cd
2021-06-29T13:52:31.3528317Z ##[command]/usr/bin/docker rm --force 6bee632c4d05aa60425a277e8d3d8d251776c81d676eeb8866b2668427b57293
2021-06-29T13:52:31.3854268Z 6bee632c4d05aa60425a277e8d3d8d251776c81d676eeb8866b2668427b57293
2021-06-29T13:52:31.3898547Z Remove container network: github_network_10190d4fbfea416f8ed6edbf70ead6e2
2021-06-29T13:52:31.3901267Z ##[command]/usr/bin/docker network rm github_network_10190d4fbfea416f8ed6edbf70ead6e2
2021-06-29T13:52:31.5259536Z github_network_10190d4fbfea416f8ed6edbf70ead6e2
2021-06-29T13:52:31.5306794Z Cleaning up orphan processes

Issue Analytics

  • State:closed
  • Created 2 years ago
  • Reactions:1
  • Comments:5 (3 by maintainers)

github_iconTop GitHub Comments

1reaction
0x2b3bfa0commented, Jun 29, 2021

Feel free to reopen this issue if you keep experiencing this problem in any of your workflows.

1reaction
0x2b3bfa0commented, Jun 29, 2021

Awesome! The latest tag is stale and what we usually recommend is pinning CML to one of the major version sets specified in the documentation.

Read more comments on GitHub >

github_iconTop Results From Across the Web

NVIDIA drivers or nvidia-docker issues #860
Comming from discord Error response from daemon: OCI runtime create failed: ... initialization error: nvml error: driver not loaded: unknown.
Read more >
NVIDIA Docker - nvml error: driver not loaded
However, when I try checking that my nvidia-docker installation was successful, I get the following error: $ sudo docker run --gpus all --rm ......
Read more >
How to resolve "Failed to initialize NVML: Driver/library ...
Solution 1: Drain and reboot the worker. Rebooting the node is the easiest way to fix the issue. Rebooting the node will make...
Read more >
Nvml error: driver/library version mismatch - cuOpt
Hi, Getting docker: Error response from daemon: OCI runtime create failed: container_linux.go:370: starting container process caused: ...
Read more >
How can I fix “Failed to initialize NVML: Driver/library version ...
The “Failed to initialize NVML: Driver/library version mismatch?” error generally means the CUDA Driver is still running an older release ...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found