question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

`cml runner` early shutdown from idle-timeout with active job

See original GitHub issue

Cloud provider: gcp SCM / CICD: github / actions

I have been playing with short --idle-timeout values and I was a bit baffled when I encountered this. One workflow shutdown earlier before the GitHub action completes the final job, and the other completes as expected, but they are using nearly identical workflows.

Successful workflow:

name: Build and run Inference - DEV
on:
  push:
    branches: [dev]
jobs:
  build:
    environment: dev
    runs-on: ubuntu-latest
    steps:
      - uses: dacbd/gcr-build-push@main
        with:
          tags: latest
          project-id: ${{ secrets.GCP_PROJECT_ID }}
          GCR-key: ${{ secrets.GCP_GCR_KEY }}
          container-name: ***
  deploy-runner:
    runs-on: ubuntu-latest
    environment: dev
    steps:
      - uses: iterative/setup-cml@v1
      - uses: actions/checkout@v2
      - name: Deploy runner on GCP
        env:
          GOOGLE_APPLICATION_CREDENTIALS_DATA: ${{ secrets.GCP_CML_RUNNER_KEY }}
        run: |
          cml-runner \
            --single \
            --idle-timeout=360 \
            --token=${{ secrets.PAT_DCB }} \
            --cloud=gcp \
            --cloud-region=us-west \
            --cloud-type=e2-standard-16
  run-inference:
    needs: [build, deploy-runner]
    runs-on: [self-hosted, cml]
    environment: dev
    steps:
      - uses: actions/checkout@v2
      - name: Setup Container image name
        run: echo "ML_CONTAINER=gcr.io/${{ secrets.GCP_PROJECT_ID }}/***" >> "${GITHUB_ENV}"
      - name: Docker GCR Login
        uses: docker/login-action@v1
        with: 
          registry: gcr.io
          username: _json_key
          password: ${{ secrets.GCP_GCR_KEY }}
      - name: Load container from GCR
        run: docker pull ${{ env.ML_CONTAINER }}:latest
      - name: Run Inference
        run: |
          docker run ****

Failed workflow:

name: Build and run Inference - PROD
on:
  schedule: # Saturday at 1400 UTC
    - cron: '0 14 * * 6'
  push:
    branches: [prod]
jobs:
  build:
    environment: prod
    runs-on: ubuntu-latest
    steps:
      - uses: dacbd/gcr-build-push@v1
        with:
          tags: latest
          project-id: ${{ secrets.GCP_PROJECT_ID }}
          GCR-key: ${{ secrets.GCP_GCR_KEY }}
          container-name: ***
  deploy-runner:
    runs-on: ubuntu-latest
    environment: prod
    steps:
      - uses: iterative/setup-cml@v1
      - uses: actions/checkout@v2
      - name: Deploy runner on GCP
        env:
          GOOGLE_APPLICATION_CREDENTIALS_DATA: ${{ secrets.GCP_CML_RUNNER_KEY }}
        run: |
          cml-runner \
            --single \
            --idle-timeout=360 \
            --token=${{ secrets.PAT_DCB }} \
            --cloud=gcp \
            --cloud-region=us-west \
            --cloud-type=e2-standard-16
  run-inference:
    needs: [build, deploy-runner]
    runs-on: [self-hosted, cml]
    environment: prod
    steps:
      - uses: actions/checkout@v2
      - name: Setup Container image name
        run: echo "ML_CONTAINER=gcr.io/${{ secrets.GCP_PROJECT_ID }}/***" >> "${GITHUB_ENV}"
      - name: Docker GCR Login
        uses: docker/login-action@v1
        with: 
          registry: gcr.io
          username: _json_key
          password: ${{ secrets.GCP_GCR_KEY }}
      - name: Load container from GCR
        run: docker pull ${{ env.ML_CONTAINER }}
      - name: Run Inference
        run: |
          docker run ***

Successful cml.service log

daniel_barnes@cml-z53bz3oz1t:~$ sudo journalctl -u cml.service -f
-- Logs begin at Tue 2021-11-09 20:03:04 UTC. --
Nov 09 20:05:47 cml-z53bz3oz1t systemd[1]: Started cml.service.
Nov 09 20:05:52 cml-z53bz3oz1t cml.sh[17039]: {"level":"info","message":"Preparing workdir /tmp/tmp.7PolL2jgVJ/.cml/cml-z53bz3oz1t..."}
Nov 09 20:05:52 cml-z53bz3oz1t cml.sh[17039]: {"level":"info","message":"Launching github runner"}
Nov 09 20:06:02 cml-z53bz3oz1t cml.sh[17039]: {"level":"warn","message":"SpotNotifier can not be started."}
Nov 09 20:06:03 cml-z53bz3oz1t cml.sh[17039]: {"date":"2021-11-09T20:06:03.618Z","level":"info","message":"runner status","repo":"https://github.com/yyyy/xxxx"}
Nov 09 20:06:03 cml-z53bz3oz1t cml.sh[17039]: {"date":"2021-11-09T20:06:03.619Z","level":"info","message":"runner status √ Connected to GitHub","repo":"https://github.com/yyyy/xxxx"}
Nov 09 20:06:04 cml-z53bz3oz1t cml.sh[17039]: {"date":"2021-11-09T20:06:04.057Z","level":"info","message":"runner status Listening for Jobs","repo":"https://github.com/yyyy/xxxx","status":"ready"}
Nov 09 20:06:16 cml-z53bz3oz1t cml.sh[17039]: {"date":"2021-11-09T20:06:15.944Z","job":4157205540,"level":"info","message":"runner status Running job: run-inference","repo":"https://github.com/yyyy/xxxx","status":"job_started"}
Nov 09 20:18:21 cml-z53bz3oz1t cml.sh[17039]: {"date":"2021-11-09T20:18:21.355Z","job":"","level":"info","message":"runner status Job run-inference completed with result: Succeeded","repo":"https://github.com/yyyy/xxxx","status":"job_ended","success":true}
Nov 09 20:18:21 cml-z53bz3oz1t cml.sh[17039]: {"date":"2021-11-09T20:18:21.595Z","level":"info","message":"runner status √ Removed .credentials","repo":"https://github.com/yyyy/xxxx"}
Nov 09 20:18:21 cml-z53bz3oz1t cml.sh[17039]: {"date":"2021-11-09T20:18:21.596Z","level":"info","message":"runner status √ Removed .runner","repo":"https://github.com/yyyy/xxxx"}
Nov 09 20:18:21 cml-z53bz3oz1t cml.sh[17039]: {"level":"info","message":"runner status","reason":"proc_exit","status":"terminated"}
Nov 09 20:18:21 cml-z53bz3oz1t cml.sh[17039]: {"level":"info","message":"waiting 20 seconds before exiting..."}
Nov 09 20:18:41 cml-z53bz3oz1t cml.sh[17039]: {"level":"info","message":"Unregistering runner cml-z53bz3oz1t..."}
Nov 09 20:18:41 cml-z53bz3oz1t cml.sh[17039]: {"level":"error","message":"\tFailed: Cannot destructure property 'id' of '(intermediate value)' as it is undefined."}
Nov 09 20:18:44 cml-z53bz3oz1t systemd[1]: cml.service: Succeeded.

Failed cml.service log

daniel_barnes@cml-uawryesbyo:~$ sudo journalctl -u cml.service -f
-- Logs begin at Tue 2021-11-09 19:23:49 UTC. --
Nov 09 19:26:34 cml-uawryesbyo systemd[1]: Started cml.service.
Nov 09 19:26:39 cml-uawryesbyo cml.sh[17135]: {"level":"info","message":"Preparing workdir /tmp/tmp.TAz3gTOugR/.cml/cml-uawryesbyo..."}
Nov 09 19:26:39 cml-uawryesbyo cml.sh[17135]: {"level":"info","message":"Launching github runner"}
Nov 09 19:26:49 cml-uawryesbyo cml.sh[17135]: {"level":"warn","message":"SpotNotifier can not be started."}
Nov 09 19:26:50 cml-uawryesbyo cml.sh[17135]: {"date":"2021-11-09T19:26:50.163Z","level":"info","message":"runner status","repo":"https://github.com/yyyy/xxxx"}
Nov 09 19:26:50 cml-uawryesbyo cml.sh[17135]: {"date":"2021-11-09T19:26:50.163Z","level":"info","message":"runner status √ Connected to GitHub","repo":"https://github.com/yyyy/xxxx"}
Nov 09 19:26:50 cml-uawryesbyo cml.sh[17135]: {"date":"2021-11-09T19:26:50.583Z","level":"info","message":"runner status Listening for Jobs","repo":"https://github.com/yyyy/xxxx","status":"ready"}
Nov 09 19:27:02 cml-uawryesbyo cml.sh[17135]: {"level":"warn","message":"Failed parsing log: Reduce of empty array with no initial value"}
Nov 09 19:27:02 cml-uawryesbyo cml.sh[17135]: {"level":"warn","message":"Original log bytes, as Base64: 2021-11-09 19:27:01Z: Running job: run-inference\n"}
Nov 09 19:32:51 cml-uawryesbyo cml.sh[17135]: {"level":"info","message":"runner status","reason":"timeout:360","status":"terminated"}
Nov 09 19:32:51 cml-uawryesbyo cml.sh[17135]: {"level":"info","message":"waiting 20 seconds before exiting..."}
Nov 09 19:33:11 cml-uawryesbyo cml.sh[17135]: {"level":"info","message":"Unregistering runner cml-uawryesbyo..."}
Nov 09 19:33:12 cml-uawryesbyo cml.sh[17135]: {"level":"error","message":"\tFailed: Bad request - Runner \"cml-uawryesbyo\" is still running a job\""}
Nov 09 19:33:22 cml-uawryesbyo systemd[1]: cml.service: Succeeded.

Issue Analytics

  • State:closed
  • Created 2 years ago
  • Reactions:1
  • Comments:44 (34 by maintainers)

github_iconTop GitHub Comments

4reactions
dacbdcommented, Mar 2, 2022

This particular instance has is a pretty healthy size and has ran several times without incident.

I have seen several Self-Hosted GitHub Runners crash due to being starved of resources and I haven’t seen them behave like this. (Show the job as Cancelled). Even though GitHub says in it’s message that it could be from resources. (At least in our use they seem die in much more dramatic fashion 😂)

3reactions
dacbdcommented, Feb 15, 2022

I haven’t been able to get this error again, but in my past logs I have not had that line, and the PAT has had the same permissions.

however, in both cases, they were prefaced with a log parsing failure.

Nov 09 19:27:02 cml-uawryesbyo cml.sh[17135]: {"level":"warn","message":"Failed parsing log: Reduce of empty array with no initial value"}

for the one that I was able to capture.

Read more comments on GitHub >

github_iconTop Results From Across the Web

runner | CML
--no-retry : Don't restart the workflow when terminated due to instance disposal or GitHub Actions timeout. --single : Terminate runner after one workflow...
Read more >
Bug ? idle runners are still active after the IdleTimeout time
I have setup autoscaling with openstack on ovh with different idle time depending to the period and one max of 1800 seconds. But...
Read more >
Re: Catalyst switches: Endless access-session for failed MAC ...
Hi community,. I'm using Catalyst switches and want to perform open mode (NAC monitor) mode using IBNS 2.0 configuration. The RADIUS server is...
Read more >
Interpretation of Idle Timeout - Ansys Learning Forum
My question is basically whether some setting can be adjusted so that the simulation keeps running for a longer time, even after the...
Read more >
IBM Power Systems HMC Implementation and Usage Guide
08052014data.xml 08052014data_dir.xml tmp hscroot@hmc8:~>. Note: The mkprofdata command also can be run in Power Off condition, regardless of the.
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found