cml-cloud-run with gpus
See original GitHub issueI’m testing out using an EC2 GPU w/ the cloud container cml-gpu-py3-cloud-runner. I wanted to make sure I’m on the right track:
name: train-my-model
on: [push]
jobs:
deploy-cloud-runner:
runs-on: [ubuntu-latest]
container: docker://dvcorg/cml-gpu-cloud-runner
steps:
- name: deploy
env:
repo_token: ${{ secrets.REPO_TOKEN }}
AWS_ACCESS_KEY_ID: ${{ secrets.AWS_ACCESS_KEY_ID }}
AWS_SECRET_ACCESS_KEY: ${{ secrets.AWS_SECRET_ACCESS_KEY }}
run: |
echo "Deploying..."
MACHINE="CML-$(openssl rand -hex 12)"
docker-machine create \
--driver amazonec2 \
--amazonec2-instance-type g3s.xlarge \
--amazonec2-region us-east-2 \
--amazonec2-zone a \
--amazonec2-vpc-id vpc-76f1f01e \
--amazonec2-ssh-user ubuntu \
$MACHINE
eval "$(docker-machine env --shell sh $MACHINE)"
(
docker-machine ssh $MACHINE "sudo mkdir -p /docker_machine && sudo chmod 777 /docker_machine" && \
docker-machine scp -r -q ~/.docker/machine/ $MACHINE:/docker_machine && \
docker run --name runner -d \
-v /docker_machine/machine:/root/.docker/machine \
-e RUNNER_IDLE_TIMEOUT=120 \
-e DOCKER_MACHINE=${MACHINE} \
-e RUNNER_LABELS=cml \
-e repo_token=$repo_token \
-e NVIDIA_VISIBLE_DEVICES=all \
-e RUNNER_REPO=https://github.com/andronovhopf/test_cloud \
dvcorg/cml-gpu-py3-cloud-runner && \
sleep 20 && echo "Deployed $MACHINE"
) || (echo y | docker-machine rm $MACHINE && exit 1)
train:
needs: deploy-cloud-runner
runs-on: [self-hosted,cml]
steps:
- uses: actions/checkout@v2
- name: cml_run
env:
repo_token: ${{ secrets.REPO_TOKEN }}
run: |
nvidia-smi
This isn’t working yet; looks to be issues getting the drivers setup on the self-hosted runner. I’m betting I have a flag wrong somewhere in the deploy
job. I tried adding the flag --gpus all
to docker run
but that didn’t work. Any ideas?
Issue Analytics
- State:
- Created 3 years ago
- Comments:7 (7 by maintainers)
Top Results From Across the Web
Self-hosted Runners | CML
When a workflow requires computational resources (such as GPUs), CML can automatically allocate cloud instances using cml runner . You can spin up...
Read more >You can also deploy runners with GPU on premise using CML ...
You can also deploy runners with GPU on premise using CML docker image with GPU already supported having to install only the ndivia...
Read more >Accelerating AI Training using GPUs on Cloudera Machine ...
Experience the benefits of having access to a hybrid cloud solution. Using Cloudera Machine Learning (CML), on the Cloudera Data Platform (CDP), see...
Read more >CML self-hosted runners on demand with GPUs - Iterative.ai
Use your own GPUs with GitHub Actions & GitLab for continuous machine learning. ... docker run --gpus all iterativeai/cml:0-dvc2-base1-gpu nvidia-smi.
Read more >Using NVIDIA GPUs | Cloud Run for Anthos
... processing unit (GPU) hardware accelerators for compute power with your Cloud Run for Anthos ... Select the GKE cluster with the GPU-enabled...
Read more >Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start FreeTop Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Top GitHub Comments
I did fix it, but to be honest I’m not sure what did the trick. For example, this
.yaml
withgit show origin/master
in the workflow is running fine:https://github.com/iterative/cml_cloud_case/blob/experiment/.github/workflows/cml.yaml
I did experiment with different repo token permissions ( I am currently using one repo token throughout that has repository read/write and workflow privileges, but I’m not positive workflow privileges are really needed). But not sure if that’s what fixed it.
@andronovhopf feel free to open it again