`cml runner` early shutdown from idle-timeout with active job
See original GitHub issueCloud provider: gcp
SCM / CICD: github
/ actions
I have been playing with short --idle-timeout
values and I was a bit baffled when I encountered this. One workflow shutdown earlier before the GitHub action completes the final job, and the other completes as expected, but they are using nearly identical workflows.
Successful workflow:
name: Build and run Inference - DEV
on:
push:
branches: [dev]
jobs:
build:
environment: dev
runs-on: ubuntu-latest
steps:
- uses: dacbd/gcr-build-push@main
with:
tags: latest
project-id: ${{ secrets.GCP_PROJECT_ID }}
GCR-key: ${{ secrets.GCP_GCR_KEY }}
container-name: ***
deploy-runner:
runs-on: ubuntu-latest
environment: dev
steps:
- uses: iterative/setup-cml@v1
- uses: actions/checkout@v2
- name: Deploy runner on GCP
env:
GOOGLE_APPLICATION_CREDENTIALS_DATA: ${{ secrets.GCP_CML_RUNNER_KEY }}
run: |
cml-runner \
--single \
--idle-timeout=360 \
--token=${{ secrets.PAT_DCB }} \
--cloud=gcp \
--cloud-region=us-west \
--cloud-type=e2-standard-16
run-inference:
needs: [build, deploy-runner]
runs-on: [self-hosted, cml]
environment: dev
steps:
- uses: actions/checkout@v2
- name: Setup Container image name
run: echo "ML_CONTAINER=gcr.io/${{ secrets.GCP_PROJECT_ID }}/***" >> "${GITHUB_ENV}"
- name: Docker GCR Login
uses: docker/login-action@v1
with:
registry: gcr.io
username: _json_key
password: ${{ secrets.GCP_GCR_KEY }}
- name: Load container from GCR
run: docker pull ${{ env.ML_CONTAINER }}:latest
- name: Run Inference
run: |
docker run ****
Failed workflow:
name: Build and run Inference - PROD
on:
schedule: # Saturday at 1400 UTC
- cron: '0 14 * * 6'
push:
branches: [prod]
jobs:
build:
environment: prod
runs-on: ubuntu-latest
steps:
- uses: dacbd/gcr-build-push@v1
with:
tags: latest
project-id: ${{ secrets.GCP_PROJECT_ID }}
GCR-key: ${{ secrets.GCP_GCR_KEY }}
container-name: ***
deploy-runner:
runs-on: ubuntu-latest
environment: prod
steps:
- uses: iterative/setup-cml@v1
- uses: actions/checkout@v2
- name: Deploy runner on GCP
env:
GOOGLE_APPLICATION_CREDENTIALS_DATA: ${{ secrets.GCP_CML_RUNNER_KEY }}
run: |
cml-runner \
--single \
--idle-timeout=360 \
--token=${{ secrets.PAT_DCB }} \
--cloud=gcp \
--cloud-region=us-west \
--cloud-type=e2-standard-16
run-inference:
needs: [build, deploy-runner]
runs-on: [self-hosted, cml]
environment: prod
steps:
- uses: actions/checkout@v2
- name: Setup Container image name
run: echo "ML_CONTAINER=gcr.io/${{ secrets.GCP_PROJECT_ID }}/***" >> "${GITHUB_ENV}"
- name: Docker GCR Login
uses: docker/login-action@v1
with:
registry: gcr.io
username: _json_key
password: ${{ secrets.GCP_GCR_KEY }}
- name: Load container from GCR
run: docker pull ${{ env.ML_CONTAINER }}
- name: Run Inference
run: |
docker run ***
Successful cml.service
log
daniel_barnes@cml-z53bz3oz1t:~$ sudo journalctl -u cml.service -f
-- Logs begin at Tue 2021-11-09 20:03:04 UTC. --
Nov 09 20:05:47 cml-z53bz3oz1t systemd[1]: Started cml.service.
Nov 09 20:05:52 cml-z53bz3oz1t cml.sh[17039]: {"level":"info","message":"Preparing workdir /tmp/tmp.7PolL2jgVJ/.cml/cml-z53bz3oz1t..."}
Nov 09 20:05:52 cml-z53bz3oz1t cml.sh[17039]: {"level":"info","message":"Launching github runner"}
Nov 09 20:06:02 cml-z53bz3oz1t cml.sh[17039]: {"level":"warn","message":"SpotNotifier can not be started."}
Nov 09 20:06:03 cml-z53bz3oz1t cml.sh[17039]: {"date":"2021-11-09T20:06:03.618Z","level":"info","message":"runner status","repo":"https://github.com/yyyy/xxxx"}
Nov 09 20:06:03 cml-z53bz3oz1t cml.sh[17039]: {"date":"2021-11-09T20:06:03.619Z","level":"info","message":"runner status √ Connected to GitHub","repo":"https://github.com/yyyy/xxxx"}
Nov 09 20:06:04 cml-z53bz3oz1t cml.sh[17039]: {"date":"2021-11-09T20:06:04.057Z","level":"info","message":"runner status Listening for Jobs","repo":"https://github.com/yyyy/xxxx","status":"ready"}
Nov 09 20:06:16 cml-z53bz3oz1t cml.sh[17039]: {"date":"2021-11-09T20:06:15.944Z","job":4157205540,"level":"info","message":"runner status Running job: run-inference","repo":"https://github.com/yyyy/xxxx","status":"job_started"}
Nov 09 20:18:21 cml-z53bz3oz1t cml.sh[17039]: {"date":"2021-11-09T20:18:21.355Z","job":"","level":"info","message":"runner status Job run-inference completed with result: Succeeded","repo":"https://github.com/yyyy/xxxx","status":"job_ended","success":true}
Nov 09 20:18:21 cml-z53bz3oz1t cml.sh[17039]: {"date":"2021-11-09T20:18:21.595Z","level":"info","message":"runner status √ Removed .credentials","repo":"https://github.com/yyyy/xxxx"}
Nov 09 20:18:21 cml-z53bz3oz1t cml.sh[17039]: {"date":"2021-11-09T20:18:21.596Z","level":"info","message":"runner status √ Removed .runner","repo":"https://github.com/yyyy/xxxx"}
Nov 09 20:18:21 cml-z53bz3oz1t cml.sh[17039]: {"level":"info","message":"runner status","reason":"proc_exit","status":"terminated"}
Nov 09 20:18:21 cml-z53bz3oz1t cml.sh[17039]: {"level":"info","message":"waiting 20 seconds before exiting..."}
Nov 09 20:18:41 cml-z53bz3oz1t cml.sh[17039]: {"level":"info","message":"Unregistering runner cml-z53bz3oz1t..."}
Nov 09 20:18:41 cml-z53bz3oz1t cml.sh[17039]: {"level":"error","message":"\tFailed: Cannot destructure property 'id' of '(intermediate value)' as it is undefined."}
Nov 09 20:18:44 cml-z53bz3oz1t systemd[1]: cml.service: Succeeded.
Failed cml.service
log
daniel_barnes@cml-uawryesbyo:~$ sudo journalctl -u cml.service -f
-- Logs begin at Tue 2021-11-09 19:23:49 UTC. --
Nov 09 19:26:34 cml-uawryesbyo systemd[1]: Started cml.service.
Nov 09 19:26:39 cml-uawryesbyo cml.sh[17135]: {"level":"info","message":"Preparing workdir /tmp/tmp.TAz3gTOugR/.cml/cml-uawryesbyo..."}
Nov 09 19:26:39 cml-uawryesbyo cml.sh[17135]: {"level":"info","message":"Launching github runner"}
Nov 09 19:26:49 cml-uawryesbyo cml.sh[17135]: {"level":"warn","message":"SpotNotifier can not be started."}
Nov 09 19:26:50 cml-uawryesbyo cml.sh[17135]: {"date":"2021-11-09T19:26:50.163Z","level":"info","message":"runner status","repo":"https://github.com/yyyy/xxxx"}
Nov 09 19:26:50 cml-uawryesbyo cml.sh[17135]: {"date":"2021-11-09T19:26:50.163Z","level":"info","message":"runner status √ Connected to GitHub","repo":"https://github.com/yyyy/xxxx"}
Nov 09 19:26:50 cml-uawryesbyo cml.sh[17135]: {"date":"2021-11-09T19:26:50.583Z","level":"info","message":"runner status Listening for Jobs","repo":"https://github.com/yyyy/xxxx","status":"ready"}
Nov 09 19:27:02 cml-uawryesbyo cml.sh[17135]: {"level":"warn","message":"Failed parsing log: Reduce of empty array with no initial value"}
Nov 09 19:27:02 cml-uawryesbyo cml.sh[17135]: {"level":"warn","message":"Original log bytes, as Base64: 2021-11-09 19:27:01Z: Running job: run-inference\n"}
Nov 09 19:32:51 cml-uawryesbyo cml.sh[17135]: {"level":"info","message":"runner status","reason":"timeout:360","status":"terminated"}
Nov 09 19:32:51 cml-uawryesbyo cml.sh[17135]: {"level":"info","message":"waiting 20 seconds before exiting..."}
Nov 09 19:33:11 cml-uawryesbyo cml.sh[17135]: {"level":"info","message":"Unregistering runner cml-uawryesbyo..."}
Nov 09 19:33:12 cml-uawryesbyo cml.sh[17135]: {"level":"error","message":"\tFailed: Bad request - Runner \"cml-uawryesbyo\" is still running a job\""}
Nov 09 19:33:22 cml-uawryesbyo systemd[1]: cml.service: Succeeded.
Issue Analytics
- State:
- Created 2 years ago
- Reactions:1
- Comments:44 (34 by maintainers)
Top Results From Across the Web
runner | CML
--no-retry : Don't restart the workflow when terminated due to instance disposal or GitHub Actions timeout. --single : Terminate runner after one workflow...
Read more >Bug ? idle runners are still active after the IdleTimeout time
I have setup autoscaling with openstack on ovh with different idle time depending to the period and one max of 1800 seconds. But...
Read more >Re: Catalyst switches: Endless access-session for failed MAC ...
Hi community,. I'm using Catalyst switches and want to perform open mode (NAC monitor) mode using IBNS 2.0 configuration. The RADIUS server is...
Read more >Interpretation of Idle Timeout - Ansys Learning Forum
My question is basically whether some setting can be adjusted so that the simulation keeps running for a longer time, even after the...
Read more >IBM Power Systems HMC Implementation and Usage Guide
08052014data.xml 08052014data_dir.xml tmp hscroot@hmc8:~>. Note: The mkprofdata command also can be run in Power Off condition, regardless of the.
Read more >Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start FreeTop Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Top GitHub Comments
This particular instance has is a pretty healthy size and has ran several times without incident.
I have seen several Self-Hosted GitHub Runners crash due to being starved of resources and I haven’t seen them behave like this. (Show the job as Cancelled). Even though GitHub says in it’s message that it could be from resources. (At least in our use they seem die in much more dramatic fashion 😂)
I haven’t been able to get this error again, but in my past logs I have not had that line, and the PAT has had the same permissions.
however, in both cases, they were prefaced with a log parsing failure.
for the one that I was able to capture.