cml-runner fails to deploy runners on ec2
See original GitHub issueHey everyone,
A random issue started appearing yesterday and cml-runner
now fails to deploy runners. The issue seems to coincide with the release of version 0.7.0
but switching back to 0.6.3
does not seem to sort the problem! Updating to 0.7.1
also didn’t fix the problem!
The command I am running is:
name: Run-Engine-Tests
- name: "Deploy runner on EC2"
shell: bash
env:
repo_token: ${{ secrets.ACCESS_TOKEN_CML_TESTING }}
AWS_ACCESS_KEY_ID: ${{ secrets.AWS_ACCESS_KEY_ID_TESTING }}
AWS_SECRET_ACCESS_KEY: ${{ secrets.AWS_SECRET_ACCESS_KEY_TESTING }}
CASE_NAME: ${{ matrix.case_name }}
N_RUNNERS: ${{ fromJson(needs.setup_config.outputs.json_string).n_runners }}
NEPTUNE_CUSTOM_RUN_ID: ${{ needs.setup_neptune_custom_run_id.outputs.neptune_custom_run_id }}
run: |
for (( i=1; i<=N_RUNNERS; i++ ))
do
echo "Deploying runner ${i}"
cml-runner \
--cloud aws \
--cloud-region eu-west-2 \
--cloud-type=m \
--cloud-hdd-size 100 \
--cloud-spot \
--labels=cml-runner-${NEPTUNE_CUSTOM_RUN_ID} || exit 1 &
done
wait
echo "Deployed ${N_RUNNERS} runners."
- run: >-
cat "$TF_LOG_PATH"
I’ve cut it a bit short so that you can only see the relevant part. I’m also attaching the terraform logs, hopefully it helps!
Looking at the EC2 console on the AWS side, I can see that the EC2 instances spin up properly, but then get shut down after about 30 seconds. On the spot requests tab, the status is displayed as terminated-by-user
, so it’s not AWS shutting them down.
Finally, I also noticed that the name of the runners on EC2 is now Hosted Agent
, which didn’t use to be the case before. It used to be something like iterative-<random_stirng>
. Not sure if it’s relevant but putting it out there just in case!
1_Set up job.txt 2_Run actionscheckout@v2.txt 3_Run iterativesetup-cml@v1.txt 4_Deploy runner on EC2.txt 5_Run cat $TF_LOG_PATH.txt 10_Post Run actionscheckout@v2.txt 11_Complete job.txt
Issue Analytics
- State:
- Created 2 years ago
- Comments:6 (4 by maintainers)
Top GitHub Comments
yup i can confirm this fixes it! Do you mind explaining why 😅 i’m just curious!
@thatGreekGuy96, please run the following to confirm the issue: