Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

CML seemingly fails to restart job after AWS Spot instances have been shut down

See original GitHub issue

Hey everyone, So I noticed a couple of days ago that CML now has new functionality that allows it to restart workflows if one or more AWS spot runners have been told to shut down. However this doesn’t seem to be happening for me.

A couple of details about our case:

Our cloud is AWS
We’re (as far as i can tell) using the latest version of CML to deploy a bunch of runners as shown below.

  deploy_runners:
    name: Deploy Cloud Instances
    needs: [setup_config]

    runs-on: ubuntu-latest

    steps:
      - uses: actions/checkout@v2
      - uses: iterative/setup-cml@v1
        with:
          version:  latest

      - name: "Deploy runner on EC2"
        shell: bash
        env:
          repo_token: ${{ secrets.ACCESS_TOKEN_CML }}
          AWS_ACCESS_KEY_ID: ${{ secrets.AWS_ACCESS_KEY_ID_TESTING }}
          AWS_SECRET_ACCESS_KEY: ${{ secrets.AWS_SECRET_ACCESS_KEY_TESTING }}
          CASE_NAME: ${{ matrix.case_name }}
          N_RUNNERS: ${{ fromJson(needs.setup_config.outputs.json_string).n_runners }}

        run: |
          for (( i=1; i<=N_RUNNERS; i++ ))
          do
            echo "Deploying runner ${i}"
            cml-runner \
            --cloud aws \
            --cloud-region eu-west-2 \
            --cloud-type=m \
            --cloud-hdd-size 100 \
            --cloud-spot \
            --labels=cml-runner &
          done
          wait
          echo "Deployed ${N_RUNNERS} runners."

The job each runner runs does not use the CML images provided by iterative
The job that each runner runs has continue-on-error set to False (wondering whether that is interfering with cml?)

  run_optimisation:
    continue-on-error: false
    strategy:
      matrix: ${{fromJson(needs.setup_config.outputs.json_string).matrix}}
      fail-fast: true

    runs-on: [self-hosted, "cml-runner"]
    container:
      image: python:3.8.10-slim
      volumes:
          - /dev/shm:/dev/shm

Issue Analytics

State:
Created 2 years ago
Comments:15 (9 by maintainers)

Top GitHub Comments

1reaction

thatGreekGuy96commented, Jul 20, 2021

Awesome, thank you!

1reaction

thatGreekGuy96commented, Jul 8, 2021

I ran another test and now i’m a little confused: I followed your instructions again and got a similar log (log.txt)

On the EC2 console, I can see that the instance in question has indeed been terminated.

HOWEVER, the spot request that created it still has a “fulfilled” status AND the github actions job is still running…

I hope this makes more sense to you than it does to me!

Update: After about 5 minutes, the spot request was marked as terminated-by-user, but the github actions job is still running… As far as I can tell, no new spot requests have been made.

Top Results From Across the Web

Spot Instance interruptions - Amazon Elastic Compute Cloud

When Amazon EC2 interrupts a Spot Instance, it either terminates, stops, or hibernates the instance, depending on what you specified when you created...

Troubleshoot unexpected EC2 Spot Instance termination - AWS

Amazon EC2 can interrupt your Spot Instance at any time with a two-minute notice for the following reasons: Lack of Spot capacity: Amazon ......

Spot Instance interruption notices - AWS Documentation

A Spot Instance interruption notice is a warning that is issued two minutes before Amazon EC2 stops or terminates your Spot Instance.

Amazon EC2 Spot instances can now be stopped and started ...

When you stop your Spot Instance, the EBS root device and attached EBS volumes are saved and their data persists. Upon restart, the...

Stop interrupted Spot Instances - AWS Documentation

For a Spot Instance launched by a persistent Spot Instance request: Amazon EC2 restarts the stopped instance when capacity is available in the...