CML seemingly fails to restart job after AWS Spot instances have been shut down
See original GitHub issueHey everyone, So I noticed a couple of days ago that CML now has new functionality that allows it to restart workflows if one or more AWS spot runners have been told to shut down. However this doesn’t seem to be happening for me.
A couple of details about our case:
- Our cloud is AWS
- We’re (as far as i can tell) using the
latest
version of CML to deploy a bunch of runners as shown below.
deploy_runners:
name: Deploy Cloud Instances
needs: [setup_config]
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v2
- uses: iterative/setup-cml@v1
with:
version: latest
- name: "Deploy runner on EC2"
shell: bash
env:
repo_token: ${{ secrets.ACCESS_TOKEN_CML }}
AWS_ACCESS_KEY_ID: ${{ secrets.AWS_ACCESS_KEY_ID_TESTING }}
AWS_SECRET_ACCESS_KEY: ${{ secrets.AWS_SECRET_ACCESS_KEY_TESTING }}
CASE_NAME: ${{ matrix.case_name }}
N_RUNNERS: ${{ fromJson(needs.setup_config.outputs.json_string).n_runners }}
run: |
for (( i=1; i<=N_RUNNERS; i++ ))
do
echo "Deploying runner ${i}"
cml-runner \
--cloud aws \
--cloud-region eu-west-2 \
--cloud-type=m \
--cloud-hdd-size 100 \
--cloud-spot \
--labels=cml-runner &
done
wait
echo "Deployed ${N_RUNNERS} runners."
- The job each runner runs does not use the CML images provided by iterative
- The job that each runner runs has
continue-on-error
set toFalse
(wondering whether that is interfering with cml?)
run_optimisation:
continue-on-error: false
strategy:
matrix: ${{fromJson(needs.setup_config.outputs.json_string).matrix}}
fail-fast: true
runs-on: [self-hosted, "cml-runner"]
container:
image: python:3.8.10-slim
volumes:
- /dev/shm:/dev/shm
Issue Analytics
- State:
- Created 2 years ago
- Comments:15 (9 by maintainers)
Top Results From Across the Web
Spot Instance interruptions - Amazon Elastic Compute Cloud
When Amazon EC2 interrupts a Spot Instance, it either terminates, stops, or hibernates the instance, depending on what you specified when you created...
Read more >Troubleshoot unexpected EC2 Spot Instance termination - AWS
Amazon EC2 can interrupt your Spot Instance at any time with a two-minute notice for the following reasons: Lack of Spot capacity: Amazon ......
Read more >Spot Instance interruption notices - AWS Documentation
A Spot Instance interruption notice is a warning that is issued two minutes before Amazon EC2 stops or terminates your Spot Instance.
Read more >Amazon EC2 Spot instances can now be stopped and started ...
When you stop your Spot Instance, the EBS root device and attached EBS volumes are saved and their data persists. Upon restart, the...
Read more >Stop interrupted Spot Instances - AWS Documentation
For a Spot Instance launched by a persistent Spot Instance request: Amazon EC2 restarts the stopped instance when capacity is available in the...
Read more >Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start FreeTop Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Top GitHub Comments
Awesome, thank you!
I ran another test and now i’m a little confused: I followed your instructions again and got a similar log (log.txt)
On the EC2 console, I can see that the instance in question has indeed been terminated.
HOWEVER, the spot request that created it still has a “fulfilled” status AND the github actions job is still running…
I hope this makes more sense to you than it does to me!
Update: After about 5 minutes, the spot request was marked as
terminated-by-user
, but the github actions job is still running… As far as I can tell, no new spot requests have been made.