Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Instances intermittently fail to terminate

See original GitHub issue

I’ve had a couple of instances recently that have failed to terminate. In the most recent case this was with the --reuse flag set, having run a series of 8 queued jobs.

The instance is sitting idle, with a timeout of 60s having passed ten minutes ago. I’ll need to terminate the instance manually from the command line.

In the most serious case, I had an instance run for two weeks without terminating. It took so long for us to notice because the instance name did not get set to cml-* as usual.

Here’s the yml we are using:

name: train and evaluate rasa model

on:
  pull_request:
    types: [opened, synchronize]
  workflow_dispatch:

jobs:
  deploy-runner:
    runs-on: [ubuntu-latest]
    steps:
      - uses: actions/checkout@v2
      - uses: iterative/setup-cml@v1

      - name: deploy
        shell: bash
        env:
          REPO_TOKEN: ${{ secrets.PERSONAL_ACCESS_TOKEN }}
          AWS_ACCESS_KEY_ID: ${{ secrets.AWS_ACCESS_KEY_ID }}
          AWS_SECRET_ACCESS_KEY: ${{ secrets.AWS_SECRET_ACCESS_KEY }}
        run: |
          cml-runner \
          --cloud aws \
          --cloud-region eu-west \
          --cloud-type=c5a.4xlarge \
          --cloud-spot true \
          --labels=cml-runner,voice-control,oms-rasa-2 \
          --idle-timeout 60 \
          --reuse
  model-training:
    needs: deploy-runner
    runs-on: [self-hosted,cml-runner]
    container: docker://dvcorg/cml:0-dvc2-base1

    steps:
    - uses: actions/checkout@v2
      with: 
        ref: ${{ github.event.pull_request.head.sha }}

    - uses: actions/setup-python@v2
      with:
        python-version: '3.8.5'
    - name: Install dependencies
      run: |
        apt-get update -y
        apt-get install make python3-pip virtualenv curl
    - name: cml
      env:
        REPO_TOKEN: ${{ secrets.PERSONAL_ACCESS_TOKEN }}
        AWS_ACCESS_KEY_ID: ${{ secrets.AWS_ACCESS_KEY_ID }}
        AWS_SECRET_ACCESS_KEY: ${{ secrets.AWS_SECRET_ACCESS_KEY }}
        AWS_REGION: eu-west-1
      run: |
        python --version
        make virtualenv
        dvc repro
        echo "## Metrics" > report.md
        git fetch --prune
        dvc metrics diff main --show-md | grep "Change\|\-\-\-" >> report.md
        dvc metrics diff main --show-md | grep -E "(intent|entity|action).*weighted" | sort >> report.md
        sed "s/results\///g" -i report.md
        cml-send-comment report.md
        dvc push

    - uses: actions/upload-artifact@v2
      with:
        name: gh-artifact-${{ github.event.pull_request.head.sha }}
        path: |
          report.md
          results
        retention-days: 30
        
    - uses: EndBug/add-and-commit@v7
      if: ${{ github.ref != 'refs/heads/main' }} && ${{ github.ref != 'refs/heads/rasax/prod' }}
      with:
         add: 'dvc.lock --force'
         pull_strategy: 'NO-PULL'
         message: 'chg: dvc repro'

Issue Analytics

State:
Created 2 years ago
Comments:24 (12 by maintainers)

Top GitHub Comments

2reactions

ivyleavedtoadflaxcommented, Jul 19, 2021

Amazing, great work @DavidGOrtega 👏

2reactions

jamt9000commented, Jul 14, 2021

I have observed this too, with instances started with --single staying up for days. As a workaround I am now using echo "/sbin/poweroff" | /usr/bin/at now + 60 min on startup to schedule a shutdown.