cml-py3 doesn't see GPU resources
See original GitHub issueWhen I try to run a train task on our AWS own infra the following error is raised on Github actions:
...
cfed3d9d6c7f: Pull complete
b5f3fa781593: Pull complete
53448e1579d7: Pull complete
c17eb7b4b5ac: Pull complete
25af3821284d: Pull complete
ea9f7c675b08: Pull complete
6522e7c5ced1: Pull complete
5fb2b6b033bf: Pull complete
1d90b6421d53: Pull complete
5d8a82854f4e: Pull complete
6fa3b0a92e5c: Pull complete
Digest: sha256:2e99adfe066a4383e3d391e5d4f1fbebc37b2c3d8f33ab883e810b35dd771965
Status: Downloaded newer image for dvcorg/cml-py3:latest
dfae88d60614134c3aeb2dc9095356b8cd545e1ad521f7db6575b518fe3ad679
docker: Error response from daemon: could not select device driver "" with capabilities: [[gpu]].
About to remove cml1596929922
WARNING: This action will delete both local reference and remote instance.
Successfully removed cml1596929922
the workflow looks like this:
name: train-model
on: [push]
jobs:
deploy-cloud-runner:
runs-on: [service-catalog, linux, x64]
container: docker://dvcorg/cml
steps:
- name: deploy
env:
repo_token: ${{ secrets.REPO_TOKEN }}
AWS_ACCESS_KEY_ID: ${{ secrets.AWS_ACCESS_KEY_ID_EC2 }}
AWS_SECRET_ACCESS_KEY: ${{ secrets.AWS_SECRET_ACCESS_KEY_EC2 }}
run: |
echo "Deploying..."
distribution=$(. /etc/os-release;echo $ID$VERSION_ID)
curl -s -L https://nvidia.github.io/nvidia-docker/gpgkey | apt-key add -
curl -s -L https://nvidia.github.io/nvidia-docker/$distribution/nvidia-docker.list | tee /etc/apt/sources.list.d/nvidia-docker.list
apt-get update && apt-get install -y nvidia-container-toolkit
RUNNER_LABELS="cml,aws"
RUNNER_REPO="https://github.com/${GITHUB_REPOSITORY}"
MACHINE="cml$(date +%s)"
docker-machine create \
--driver amazonec2 \
--amazonec2-instance-type g3s.xlarge \
--amazonec2-vpc-id vpc-xxxxxxxx \
--amazonec2-region eu-west-1 \
--amazonec2-zone "a" \
--amazonec2-ssh-user ubuntu \
--amazonec2-ami ami-089cc16f7f08c4457 \
--amazonec2-root-size 10 \
$MACHINE
eval "$(docker-machine env --shell sh $MACHINE)"
(
docker-machine ssh $MACHINE "sudo mkdir -p \
/docker_machine && \
sudo chmod 777 /docker_machine" && \
docker-machine scp -r -q ~/.docker/machine/ \
$MACHINE:/docker_machine && \
docker run --name runner --gpus all -d \
-v /docker_machine/machine:/root/.docker/machine \
-e DOCKER_MACHINE=$MACHINE \
-e repo_token=$repo_token \
-e RUNNER_LABELS=$RUNNER_LABELS \
-e RUNNER_REPO=$RUNNER_REPO \
-e RUNNER_IDLE_TIMEOUT=120 \
dvcorg/cml-py3:latest && \
sleep 20 && echo "Deployed $MACHINE"
) || (docker-machine rm -y -f $MACHINE && exit 1)
train:
# ....
We run the tests on a self hosted runner.
Issue Analytics
- State:
- Created 3 years ago
- Reactions:1
- Comments:21 (12 by maintainers)
Top Results From Across the Web
Tensorflow doesn't seem to see my gpu - Stack Overflow
I came across this same issue in jupyter notebooks. This could be an easy fix. $ pip uninstall tensorflow $ pip install tensorflow-gpu....
Read more >CML API v2 - Cloudera Documentation
CML API v2. Cloudera Machine Learning exposes a REST API that you can use to perform operations related to projects, jobs, and runs....
Read more >Install GPU drivers | Compute Engine Documentation
Ensure that Python 3 is installed on your operating system. · Run the installation script. sudo python3 install_gpu_driver.py. The script takes some time...
Read more >Google Colab - Using Free GPU - Tutorialspoint
Google Colab - Using Free GPU, Google provides the use of free GPU for your Colab notebooks. ... To see the memory resources...
Read more >Using Your GPU in a Docker Container - Roboflow Blog
Error: Docker does not find Nvidia drivers ... The resources below may be useful in helping your configure the GPU on your computer:....
Read more >Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start FreeTop Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Top GitHub Comments
Thank you a lot @DavidGOrtega, the issue was indeed related to the token (its name was not matching anymore the workflow script). Now it’s fixed and the workflow works like a charm.
@pommedeterresautee (love your nick) checking