question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

cml-py3 doesn't see GPU resources

See original GitHub issue

When I try to run a train task on our AWS own infra the following error is raised on Github actions:

...
cfed3d9d6c7f: Pull complete
b5f3fa781593: Pull complete
53448e1579d7: Pull complete
c17eb7b4b5ac: Pull complete
25af3821284d: Pull complete
ea9f7c675b08: Pull complete
6522e7c5ced1: Pull complete
5fb2b6b033bf: Pull complete
1d90b6421d53: Pull complete
5d8a82854f4e: Pull complete
6fa3b0a92e5c: Pull complete
Digest: sha256:2e99adfe066a4383e3d391e5d4f1fbebc37b2c3d8f33ab883e810b35dd771965
Status: Downloaded newer image for dvcorg/cml-py3:latest
dfae88d60614134c3aeb2dc9095356b8cd545e1ad521f7db6575b518fe3ad679
docker: Error response from daemon: could not select device driver "" with capabilities: [[gpu]].
About to remove cml1596929922
WARNING: This action will delete both local reference and remote instance.
Successfully removed cml1596929922

the workflow looks like this:

name: train-model

on: [push]

jobs:
  deploy-cloud-runner:
    runs-on: [service-catalog, linux, x64]
    container: docker://dvcorg/cml

    steps:
      - name: deploy
        env:
          repo_token: ${{ secrets.REPO_TOKEN }}
          AWS_ACCESS_KEY_ID: ${{ secrets.AWS_ACCESS_KEY_ID_EC2 }}
          AWS_SECRET_ACCESS_KEY: ${{ secrets.AWS_SECRET_ACCESS_KEY_EC2 }}
        run: |
          echo "Deploying..."
          distribution=$(. /etc/os-release;echo $ID$VERSION_ID)
          curl -s -L https://nvidia.github.io/nvidia-docker/gpgkey | apt-key add -
          curl -s -L https://nvidia.github.io/nvidia-docker/$distribution/nvidia-docker.list | tee /etc/apt/sources.list.d/nvidia-docker.list
          apt-get update && apt-get install -y nvidia-container-toolkit
          RUNNER_LABELS="cml,aws"
          RUNNER_REPO="https://github.com/${GITHUB_REPOSITORY}"
          MACHINE="cml$(date +%s)"
          docker-machine create \
            --driver amazonec2 \
            --amazonec2-instance-type g3s.xlarge \
            --amazonec2-vpc-id vpc-xxxxxxxx \
            --amazonec2-region eu-west-1 \
            --amazonec2-zone "a" \
            --amazonec2-ssh-user ubuntu \
            --amazonec2-ami ami-089cc16f7f08c4457 \
            --amazonec2-root-size 10 \
            $MACHINE
          eval "$(docker-machine env --shell sh $MACHINE)"
          (
          docker-machine ssh $MACHINE "sudo mkdir -p \
            /docker_machine && \
          sudo chmod 777 /docker_machine" && \
          docker-machine scp -r -q ~/.docker/machine/ \
            $MACHINE:/docker_machine && \
          docker run --name runner --gpus all -d \
            -v /docker_machine/machine:/root/.docker/machine \
            -e DOCKER_MACHINE=$MACHINE \
            -e repo_token=$repo_token \
            -e RUNNER_LABELS=$RUNNER_LABELS \
            -e RUNNER_REPO=$RUNNER_REPO \
            -e RUNNER_IDLE_TIMEOUT=120 \
            dvcorg/cml-py3:latest && \
          sleep 20 && echo "Deployed $MACHINE"
          ) || (docker-machine rm -y -f $MACHINE && exit 1)
  train:
# ....

We run the tests on a self hosted runner.

Issue Analytics

  • State:closed
  • Created 3 years ago
  • Reactions:1
  • Comments:21 (12 by maintainers)

github_iconTop GitHub Comments

1reaction
pommedeterresauteecommented, Aug 10, 2020

Thank you a lot @DavidGOrtega, the issue was indeed related to the token (its name was not matching anymore the workflow script). Now it’s fixed and the workflow works like a charm.

1reaction
DavidGOrtegacommented, Aug 10, 2020

@pommedeterresautee (love your nick) checking

Read more comments on GitHub >

github_iconTop Results From Across the Web

Tensorflow doesn't seem to see my gpu - Stack Overflow
I came across this same issue in jupyter notebooks. This could be an easy fix. $ pip uninstall tensorflow $ pip install tensorflow-gpu....
Read more >
CML API v2 - Cloudera Documentation
CML API v2. Cloudera Machine Learning exposes a REST API that you can use to perform operations related to projects, jobs, and runs....
Read more >
Install GPU drivers | Compute Engine Documentation
Ensure that Python 3 is installed on your operating system. · Run the installation script. sudo python3 install_gpu_driver.py. The script takes some time...
Read more >
Google Colab - Using Free GPU - Tutorialspoint
Google Colab - Using Free GPU, Google provides the use of free GPU for your Colab notebooks. ... To see the memory resources...
Read more >
Using Your GPU in a Docker Container - Roboflow Blog
Error: Docker does not find Nvidia drivers ... The resources below may be useful in helping your configure the GPU on your computer:....
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found