question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

CML seemingly fails to restart job after AWS Spot instances have been shut down

See original GitHub issue

Hey everyone, So I noticed a couple of days ago that CML now has new functionality that allows it to restart workflows if one or more AWS spot runners have been told to shut down. However this doesn’t seem to be happening for me.

A couple of details about our case:

  • Our cloud is AWS
  • We’re (as far as i can tell) using the latest version of CML to deploy a bunch of runners as shown below.
  deploy_runners:
    name: Deploy Cloud Instances
    needs: [setup_config]

    runs-on: ubuntu-latest

    steps:
      - uses: actions/checkout@v2
      - uses: iterative/setup-cml@v1
        with:
          version:  latest

      - name: "Deploy runner on EC2"
        shell: bash
        env:
          repo_token: ${{ secrets.ACCESS_TOKEN_CML }}
          AWS_ACCESS_KEY_ID: ${{ secrets.AWS_ACCESS_KEY_ID_TESTING }}
          AWS_SECRET_ACCESS_KEY: ${{ secrets.AWS_SECRET_ACCESS_KEY_TESTING }}
          CASE_NAME: ${{ matrix.case_name }}
          N_RUNNERS: ${{ fromJson(needs.setup_config.outputs.json_string).n_runners }}

        run: |
          for (( i=1; i<=N_RUNNERS; i++ ))
          do
            echo "Deploying runner ${i}"
            cml-runner \
            --cloud aws \
            --cloud-region eu-west-2 \
            --cloud-type=m \
            --cloud-hdd-size 100 \
            --cloud-spot \
            --labels=cml-runner &
          done
          wait
          echo "Deployed ${N_RUNNERS} runners."
  • The job each runner runs does not use the CML images provided by iterative
  • The job that each runner runs has continue-on-error set to False (wondering whether that is interfering with cml?)
  run_optimisation:
    continue-on-error: false
    strategy:
      matrix: ${{fromJson(needs.setup_config.outputs.json_string).matrix}}
      fail-fast: true

    runs-on: [self-hosted, "cml-runner"]
    container:
      image: python:3.8.10-slim
      volumes:
          - /dev/shm:/dev/shm

Issue Analytics

  • State:closed
  • Created 2 years ago
  • Comments:15 (9 by maintainers)

github_iconTop GitHub Comments

1reaction
thatGreekGuy96commented, Jul 20, 2021

Awesome, thank you!

1reaction
thatGreekGuy96commented, Jul 8, 2021

I ran another test and now i’m a little confused: I followed your instructions again and got a similar log (log.txt)

On the EC2 console, I can see that the instance in question has indeed been terminated.

HOWEVER, the spot request that created it still has a “fulfilled” status AND the github actions job is still running…

I hope this makes more sense to you than it does to me!

Update: After about 5 minutes, the spot request was marked as terminated-by-user, but the github actions job is still running… As far as I can tell, no new spot requests have been made.

Read more comments on GitHub >

github_iconTop Results From Across the Web

Spot Instance interruptions - Amazon Elastic Compute Cloud
When Amazon EC2 interrupts a Spot Instance, it either terminates, stops, or hibernates the instance, depending on what you specified when you created...
Read more >
Troubleshoot unexpected EC2 Spot Instance termination - AWS
Amazon EC2 can interrupt your Spot Instance at any time with a two-minute notice for the following reasons: Lack of Spot capacity: Amazon ......
Read more >
Spot Instance interruption notices - AWS Documentation
A Spot Instance interruption notice is a warning that is issued two minutes before Amazon EC2 stops or terminates your Spot Instance.
Read more >
Amazon EC2 Spot instances can now be stopped and started ...
When you stop your Spot Instance, the EBS root device and attached EBS volumes are saved and their data persists. Upon restart, the...
Read more >
Stop interrupted Spot Instances - AWS Documentation
For a Spot Instance launched by a persistent Spot Instance request: Amazon EC2 restarts the stopped instance when capacity is available in the...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found