question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

cml-cloud-run with gpus

See original GitHub issue

I’m testing out using an EC2 GPU w/ the cloud container cml-gpu-py3-cloud-runner. I wanted to make sure I’m on the right track:

name: train-my-model

on: [push]

jobs:
  deploy-cloud-runner:
    runs-on: [ubuntu-latest]
    container: docker://dvcorg/cml-gpu-cloud-runner

    steps:
      - name: deploy
        env:
          repo_token: ${{ secrets.REPO_TOKEN }} 
          AWS_ACCESS_KEY_ID: ${{ secrets.AWS_ACCESS_KEY_ID }}
          AWS_SECRET_ACCESS_KEY: ${{ secrets.AWS_SECRET_ACCESS_KEY }}
        run: |
          echo "Deploying..."
          MACHINE="CML-$(openssl rand -hex 12)"
          docker-machine create \
              --driver amazonec2 \
              --amazonec2-instance-type g3s.xlarge \
              --amazonec2-region us-east-2 \
              --amazonec2-zone a \
              --amazonec2-vpc-id vpc-76f1f01e \
              --amazonec2-ssh-user ubuntu \
              $MACHINE
          eval "$(docker-machine env --shell sh $MACHINE)"
          ( 
          docker-machine ssh $MACHINE "sudo mkdir -p /docker_machine && sudo chmod 777 /docker_machine" && \
          docker-machine scp -r -q ~/.docker/machine/ $MACHINE:/docker_machine && \
          docker run --name runner -d \
            -v /docker_machine/machine:/root/.docker/machine \
            -e RUNNER_IDLE_TIMEOUT=120 \
            -e DOCKER_MACHINE=${MACHINE} \
            -e RUNNER_LABELS=cml \
            -e repo_token=$repo_token \
            -e NVIDIA_VISIBLE_DEVICES=all \
            -e RUNNER_REPO=https://github.com/andronovhopf/test_cloud \
           dvcorg/cml-gpu-py3-cloud-runner && \
               sleep 20 && echo "Deployed $MACHINE"
          ) || (echo y | docker-machine rm $MACHINE && exit 1)
  train:
    needs: deploy-cloud-runner
    runs-on: [self-hosted,cml]
    
    steps:
      - uses: actions/checkout@v2

      - name: cml_run
        env:
          repo_token: ${{ secrets.REPO_TOKEN }} 
        run: |
          nvidia-smi

This isn’t working yet; looks to be issues getting the drivers setup on the self-hosted runner. I’m betting I have a flag wrong somewhere in the deploy job. I tried adding the flag --gpus all to docker run but that didn’t work. Any ideas?

Issue Analytics

  • State:closed
  • Created 3 years ago
  • Comments:7 (7 by maintainers)

github_iconTop GitHub Comments

1reaction
elleobriencommented, Jun 30, 2020

I did fix it, but to be honest I’m not sure what did the trick. For example, this .yaml with git show origin/master in the workflow is running fine:

https://github.com/iterative/cml_cloud_case/blob/experiment/.github/workflows/cml.yaml

I did experiment with different repo token permissions ( I am currently using one repo token throughout that has repository read/write and workflow privileges, but I’m not positive workflow privileges are really needed). But not sure if that’s what fixed it.

0reactions
DavidGOrtegacommented, Jul 21, 2020

@andronovhopf feel free to open it again

Read more comments on GitHub >

github_iconTop Results From Across the Web

Self-hosted Runners | CML
When a workflow requires computational resources (such as GPUs), CML can automatically allocate cloud instances using cml runner . You can spin up...
Read more >
You can also deploy runners with GPU on premise using CML ...
You can also deploy runners with GPU on premise using CML docker image with GPU already supported having to install only the ndivia...
Read more >
Accelerating AI Training using GPUs on Cloudera Machine ...
Experience the benefits of having access to a hybrid cloud solution. Using Cloudera Machine Learning (CML), on the Cloudera Data Platform (CDP), see...
Read more >
CML self-hosted runners on demand with GPUs - Iterative.ai
Use your own GPUs with GitHub Actions & GitLab for continuous machine learning. ... docker run --gpus all iterativeai/cml:0-dvc2-base1-gpu nvidia-smi.
Read more >
Using NVIDIA GPUs | Cloud Run for Anthos
... processing unit (GPU) hardware accelerators for compute power with your Cloud Run for Anthos ... Select the GKE cluster with the GPU-enabled...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found