Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

cml-cloud-run with gpus

See original GitHub issue

I’m testing out using an EC2 GPU w/ the cloud container cml-gpu-py3-cloud-runner. I wanted to make sure I’m on the right track:

name: train-my-model

on: [push]

jobs:
  deploy-cloud-runner:
    runs-on: [ubuntu-latest]
    container: docker://dvcorg/cml-gpu-cloud-runner

    steps:
      - name: deploy
        env:
          repo_token: ${{ secrets.REPO_TOKEN }} 
          AWS_ACCESS_KEY_ID: ${{ secrets.AWS_ACCESS_KEY_ID }}
          AWS_SECRET_ACCESS_KEY: ${{ secrets.AWS_SECRET_ACCESS_KEY }}
        run: |
          echo "Deploying..."
          MACHINE="CML-$(openssl rand -hex 12)"
          docker-machine create \
              --driver amazonec2 \
              --amazonec2-instance-type g3s.xlarge \
              --amazonec2-region us-east-2 \
              --amazonec2-zone a \
              --amazonec2-vpc-id vpc-76f1f01e \
              --amazonec2-ssh-user ubuntu \
              $MACHINE
          eval "$(docker-machine env --shell sh $MACHINE)"
          ( 
          docker-machine ssh $MACHINE "sudo mkdir -p /docker_machine && sudo chmod 777 /docker_machine" && \
          docker-machine scp -r -q ~/.docker/machine/ $MACHINE:/docker_machine && \
          docker run --name runner -d \
            -v /docker_machine/machine:/root/.docker/machine \
            -e RUNNER_IDLE_TIMEOUT=120 \
            -e DOCKER_MACHINE=${MACHINE} \
            -e RUNNER_LABELS=cml \
            -e repo_token=$repo_token \
            -e NVIDIA_VISIBLE_DEVICES=all \
            -e RUNNER_REPO=https://github.com/andronovhopf/test_cloud \
           dvcorg/cml-gpu-py3-cloud-runner && \
               sleep 20 && echo "Deployed $MACHINE"
          ) || (echo y | docker-machine rm $MACHINE && exit 1)
  train:
    needs: deploy-cloud-runner
    runs-on: [self-hosted,cml]
    
    steps:
      - uses: actions/checkout@v2

      - name: cml_run
        env:
          repo_token: ${{ secrets.REPO_TOKEN }} 
        run: |
          nvidia-smi

This isn’t working yet; looks to be issues getting the drivers setup on the self-hosted runner. I’m betting I have a flag wrong somewhere in the deploy job. I tried adding the flag --gpus all to docker run but that didn’t work. Any ideas?

Issue Analytics

State:
Created 3 years ago
Comments:7 (7 by maintainers)

Top GitHub Comments

1reaction

elleobriencommented, Jun 30, 2020

I did fix it, but to be honest I’m not sure what did the trick. For example, this .yaml with git show origin/master in the workflow is running fine:

https://github.com/iterative/cml_cloud_case/blob/experiment/.github/workflows/cml.yaml

I did experiment with different repo token permissions ( I am currently using one repo token throughout that has repository read/write and workflow privileges, but I’m not positive workflow privileges are really needed). But not sure if that’s what fixed it.

0reactions

DavidGOrtegacommented, Jul 21, 2020

@andronovhopf feel free to open it again

Top Results From Across the Web

Self-hosted Runners | CML

When a workflow requires computational resources (such as GPUs), CML can automatically allocate cloud instances using cml runner . You can spin up...

You can also deploy runners with GPU on premise using CML ...

You can also deploy runners with GPU on premise using CML docker image with GPU already supported having to install only the ndivia...

Accelerating AI Training using GPUs on Cloudera Machine ...

Experience the benefits of having access to a hybrid cloud solution. Using Cloudera Machine Learning (CML), on the Cloudera Data Platform (CDP), see...

CML self-hosted runners on demand with GPUs - Iterative.ai

Use your own GPUs with GitHub Actions & GitLab for continuous machine learning. ... docker run --gpus all iterativeai/cml:0-dvc2-base1-gpu nvidia-smi.

Using NVIDIA GPUs | Cloud Run for Anthos

... processing unit (GPU) hardware accelerators for compute power with your Cloud Run for Anthos ... Select the GKE cluster with the GPU-enabled...