question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Instances intermittently fail to terminate

See original GitHub issue

I’ve had a couple of instances recently that have failed to terminate. In the most recent case this was with the --reuse flag set, having run a series of 8 queued jobs.

The instance is sitting idle, with a timeout of 60s having passed ten minutes ago. I’ll need to terminate the instance manually from the command line.

In the most serious case, I had an instance run for two weeks without terminating. It took so long for us to notice because the instance name did not get set to cml-* as usual.

Here’s the yml we are using:

name: train and evaluate rasa model

on:
  pull_request:
    types: [opened, synchronize]
  workflow_dispatch:

jobs:
  deploy-runner:
    runs-on: [ubuntu-latest]
    steps:
      - uses: actions/checkout@v2
      - uses: iterative/setup-cml@v1

      - name: deploy
        shell: bash
        env:
          REPO_TOKEN: ${{ secrets.PERSONAL_ACCESS_TOKEN }}
          AWS_ACCESS_KEY_ID: ${{ secrets.AWS_ACCESS_KEY_ID }}
          AWS_SECRET_ACCESS_KEY: ${{ secrets.AWS_SECRET_ACCESS_KEY }}
        run: |
          cml-runner \
          --cloud aws \
          --cloud-region eu-west \
          --cloud-type=c5a.4xlarge \
          --cloud-spot true \
          --labels=cml-runner,voice-control,oms-rasa-2 \
          --idle-timeout 60 \
          --reuse
  model-training:
    needs: deploy-runner
    runs-on: [self-hosted,cml-runner]
    container: docker://dvcorg/cml:0-dvc2-base1

    steps:
    - uses: actions/checkout@v2
      with: 
        ref: ${{ github.event.pull_request.head.sha }}

    - uses: actions/setup-python@v2
      with:
        python-version: '3.8.5'
    - name: Install dependencies
      run: |
        apt-get update -y
        apt-get install make python3-pip virtualenv curl
    - name: cml
      env:
        REPO_TOKEN: ${{ secrets.PERSONAL_ACCESS_TOKEN }}
        AWS_ACCESS_KEY_ID: ${{ secrets.AWS_ACCESS_KEY_ID }}
        AWS_SECRET_ACCESS_KEY: ${{ secrets.AWS_SECRET_ACCESS_KEY }}
        AWS_REGION: eu-west-1
      run: |
        python --version
        make virtualenv
        dvc repro
        echo "## Metrics" > report.md
        git fetch --prune
        dvc metrics diff main --show-md | grep "Change\|\-\-\-" >> report.md
        dvc metrics diff main --show-md | grep -E "(intent|entity|action).*weighted" | sort >> report.md
        sed "s/results\///g" -i report.md
        cml-send-comment report.md
        dvc push

    - uses: actions/upload-artifact@v2
      with:
        name: gh-artifact-${{ github.event.pull_request.head.sha }}
        path: |
          report.md
          results
        retention-days: 30
        
    - uses: EndBug/add-and-commit@v7
      if: ${{ github.ref != 'refs/heads/main' }} && ${{ github.ref != 'refs/heads/rasax/prod' }}
      with:
         add: 'dvc.lock --force'
         pull_strategy: 'NO-PULL'
         message: 'chg: dvc repro'

Issue Analytics

  • State:closed
  • Created 2 years ago
  • Comments:24 (12 by maintainers)

github_iconTop GitHub Comments

2reactions
ivyleavedtoadflaxcommented, Jul 19, 2021

Amazing, great work @DavidGOrtega 👏

2reactions
jamt9000commented, Jul 14, 2021

I have observed this too, with instances started with --single staying up for days. As a workaround I am now using echo "/sbin/poweroff" | /usr/bin/at now + 60 min on startup to schedule a shutdown.

Screenshot 2021-07-14 at 12 31 50 copy

(I have also had the no-name issue happen once).

Read more comments on GitHub >

github_iconTop Results From Across the Web

Troubleshoot instance termination (shutting down)
Several issues can cause your instance to terminate immediately on start-up. See Instance terminates immediately for more information.
Read more >
Why did Auto Scaling Group terminate my healthy instance(s)?
Find more details in the AWS Knowledge Center: https://amzn.to/2wiCOTKManju, an AWS Cloud Support Engineer, explains why an Amazon EC2 Auto ...
Read more >
How do I delay Auto Scaling termination of unhealthy EC2 ...
AWS KC Videos: How do I troubleshoot why my Auto Scale instances fail during scale-out deployments? Amazon Web Services•1.7K views.
Read more >
Unexpected cluster termination - Databricks Knowledge Base
Sometimes a cluster is terminated unexpectedly, not as a result of a manual termination or a configured automatic termination.
Read more >
** Troubleshooting ** Intermittent 'Connection to server lost' or ...
Intermittently the end users will receive the error message. It can occur at any time, whilst performing any action, inside any menu item....
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found