Instances intermittently fail to terminate
See original GitHub issueI’ve had a couple of instances recently that have failed to terminate. In the most recent case this was with the --reuse
flag set, having run a series of 8 queued jobs.
The instance is sitting idle, with a timeout of 60s
having passed ten minutes ago. I’ll need to terminate the instance manually from the command line.
In the most serious case, I had an instance run for two weeks without terminating. It took so long for us to notice because the instance name did not get set to cml-*
as usual.
Here’s the yml we are using:
name: train and evaluate rasa model
on:
pull_request:
types: [opened, synchronize]
workflow_dispatch:
jobs:
deploy-runner:
runs-on: [ubuntu-latest]
steps:
- uses: actions/checkout@v2
- uses: iterative/setup-cml@v1
- name: deploy
shell: bash
env:
REPO_TOKEN: ${{ secrets.PERSONAL_ACCESS_TOKEN }}
AWS_ACCESS_KEY_ID: ${{ secrets.AWS_ACCESS_KEY_ID }}
AWS_SECRET_ACCESS_KEY: ${{ secrets.AWS_SECRET_ACCESS_KEY }}
run: |
cml-runner \
--cloud aws \
--cloud-region eu-west \
--cloud-type=c5a.4xlarge \
--cloud-spot true \
--labels=cml-runner,voice-control,oms-rasa-2 \
--idle-timeout 60 \
--reuse
model-training:
needs: deploy-runner
runs-on: [self-hosted,cml-runner]
container: docker://dvcorg/cml:0-dvc2-base1
steps:
- uses: actions/checkout@v2
with:
ref: ${{ github.event.pull_request.head.sha }}
- uses: actions/setup-python@v2
with:
python-version: '3.8.5'
- name: Install dependencies
run: |
apt-get update -y
apt-get install make python3-pip virtualenv curl
- name: cml
env:
REPO_TOKEN: ${{ secrets.PERSONAL_ACCESS_TOKEN }}
AWS_ACCESS_KEY_ID: ${{ secrets.AWS_ACCESS_KEY_ID }}
AWS_SECRET_ACCESS_KEY: ${{ secrets.AWS_SECRET_ACCESS_KEY }}
AWS_REGION: eu-west-1
run: |
python --version
make virtualenv
dvc repro
echo "## Metrics" > report.md
git fetch --prune
dvc metrics diff main --show-md | grep "Change\|\-\-\-" >> report.md
dvc metrics diff main --show-md | grep -E "(intent|entity|action).*weighted" | sort >> report.md
sed "s/results\///g" -i report.md
cml-send-comment report.md
dvc push
- uses: actions/upload-artifact@v2
with:
name: gh-artifact-${{ github.event.pull_request.head.sha }}
path: |
report.md
results
retention-days: 30
- uses: EndBug/add-and-commit@v7
if: ${{ github.ref != 'refs/heads/main' }} && ${{ github.ref != 'refs/heads/rasax/prod' }}
with:
add: 'dvc.lock --force'
pull_strategy: 'NO-PULL'
message: 'chg: dvc repro'
Issue Analytics
- State:
- Created 2 years ago
- Comments:24 (12 by maintainers)
Top Results From Across the Web
Troubleshoot instance termination (shutting down)
Several issues can cause your instance to terminate immediately on start-up. See Instance terminates immediately for more information.
Read more >Why did Auto Scaling Group terminate my healthy instance(s)?
Find more details in the AWS Knowledge Center: https://amzn.to/2wiCOTKManju, an AWS Cloud Support Engineer, explains why an Amazon EC2 Auto ...
Read more >How do I delay Auto Scaling termination of unhealthy EC2 ...
AWS KC Videos: How do I troubleshoot why my Auto Scale instances fail during scale-out deployments? Amazon Web Services•1.7K views.
Read more >Unexpected cluster termination - Databricks Knowledge Base
Sometimes a cluster is terminated unexpectedly, not as a result of a manual termination or a configured automatic termination.
Read more >** Troubleshooting ** Intermittent 'Connection to server lost' or ...
Intermittently the end users will receive the error message. It can occur at any time, whilst performing any action, inside any menu item....
Read more >Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start FreeTop Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Top GitHub Comments
Amazing, great work @DavidGOrtega 👏
I have observed this too, with instances started with
--single
staying up for days. As a workaround I am now usingecho "/sbin/poweroff" | /usr/bin/at now + 60 min
on startup to schedule a shutdown.(I have also had the no-name issue happen once).