question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

`cml runner` aggressively shutting down instance with active job running

See original GitHub issue

Similar to #808, we have been seeing our GCP VM instance shutdown randomly in the first few minutes even though a job is still running (we noticed that the GitHub pull, etc. starts on the runner so I don’t think it’s an authentication based problem from the runner) log below from the VM. We have been trying to get CML working on an n2-standard-4 (slowly beefing up the server) which has 4 vCPU’s, along with 16GB ram, a 50GB HDD, and 10GBps network. Is there anyway to debug what might be causing the issues on the VM (i.e. a flag that will stop the aggresive instance destruction so that we can debug this).

Workflow File:

name: CML

env:
  GCP_PROJECT_ID: ***
  GOOGLE_APPLICATION_CREDENTIALS_DATA: ${{ secrets.GCP_CREDENTIALS }}

on:
  pull_request:
    branches:
      - main

jobs:
  deploy-runner:
    runs-on: ubuntu-latest
    steps:
      - uses: navikt/github-app-token-generator@v1
        id: get_token
        with:
          private-key: ${{ secrets.CML_GITHUB_APP_PEM }}
          app-id: ${{ secrets.CML_GITHUB_APP_ID }}
      - run: echo "REPO_TOKEN=$REPO_TOKEN" >> "$GITHUB_ENV"
        env:
          REPO_TOKEN: ${{ steps.get_token.outputs.token }}
      - uses: iterative/setup-cml@v1
      - uses: actions/checkout@v2
      - name: Deploy runner
        run: |
          cml runner \
            --cloud gcp \
            --cloud-region us-west1-a \
            --cloud-type n2-standard-8 \
            --cloud-gpu nogpu \
            --cloud-hdd-size 50 \
            --single \
            --idle-timeout 100000

  train-model:
    runs-on: [self-hosted, cml]
    needs: deploy-runner
    timeout-minutes: 4320 # 72h
    container:
      image: docker://iterativeai/cml:0-dvc2-base1
    env:
      REPO_TOKEN: ${{ secrets.GITHUB_TOKEN }}
    steps:
      - uses: actions/checkout@v2
      - name: Train model
        run: |
          python test_cml.py

          # Create CML report
          cat test_cml.txt >> report.md
          cml send-comment report.md

Output Logs:

-- Logs begin at Thu 2022-06-09 05:59:19 UTC. --
Jun 09 06:01:32 cml-xav4fkawcs systemd[1]: Started cml.service.
Jun 09 06:01:38 cml-xav4fkawcs cml.sh[21654]: {"level":"info","message":"Preparing workdir /tmp/tmp.OIaVby26sE/.cml/cml-xav4fkawcs..."}
Jun 09 06:01:38 cml-xav4fkawcs cml.sh[21654]: {"level":"info","message":"Launching github runner"}
Jun 09 06:01:55 cml-xav4fkawcs cml.sh[21654]: {"level":"warn","message":"SpotNotifier can not be started."}
Jun 09 06:01:56 cml-xav4fkawcs cml.sh[21654]: {"date":"2022-06-09T06:01:56.446Z","level":"info","message":"runner status","repo":"https://github.com/spark-64/***"}
Jun 09 06:01:56 cml-xav4fkawcs cml.sh[21654]: {"date":"2022-06-09T06:01:56.447Z","level":"info","message":"runner status √ Connected to GitHub","repo":"https://github.com/spark-64/***"}
Jun 09 06:01:56 cml-xav4fkawcs cml.sh[21654]: {"date":"2022-06-09T06:01:56.856Z","level":"info","message":"runner status Current runner version: '2.292.0'","repo":"https://github.com/spark-64/***"}
Jun 09 06:01:56 cml-xav4fkawcs cml.sh[21654]: {"date":"2022-06-09T06:01:56.857Z","level":"info","message":"runner status Listening for Jobs","repo":"https://github.com/spark-64/***","status":"ready"}
Jun 09 06:02:07 cml-xav4fkawcs cml.sh[21654]: {"date":"2022-06-09T06:02:07.077Z","job":"gh","level":"info","message":"runner status Running job: train-model","repo":"https://github.com/spark-64/***","status":"job_started"}

GitHub Actions Hanging:

image

My colleague @TessaPhillips, and I have been trying to figure this out to no avail - more than happy to contribute if you can point us in the right direction!

Issue Analytics

  • State:closed
  • Created a year ago
  • Comments:12 (12 by maintainers)

github_iconTop GitHub Comments

4reactions
dacbdcommented, Jun 11, 2022

Yeah probably a misnomer - I’m curious where the 72hr timeout came from? The usage limits show 35 days for a maximum workflow runtime - did they maybe change this timeout or does it come from somewhere else?

It does appear that they have made changes for self-hosted as well as their provided runners 🙃

1reaction
dacbdcommented, Jun 13, 2022

@danieljimeneznz feel free to join our Discord if you have more questions not directly related to an issue

Read more comments on GitHub >

github_iconTop Results From Across the Web

cml runner early shutdown from idle-timeout with active job #808
I have run these workflows several times and they consistently pass/fail in the exact same way. The line: Nov 09 19:27:02 cml-uawryesbyo cml.sh[ ......
Read more >
Pipeline gets stuck in a job when a self-hosted runner ... - GitLab
Im launching a self-hosted runner using CML, that its picking a job of the pipeline shown below. If I disconnect the runner during...
Read more >
runner | CML
--no-retry : Don't restart the workflow when terminated due to instance disposal or GitHub Actions timeout. --single : Terminate runner after one workflow...
Read more >
WAO - River Thames Conditions
Thomas swann lane, Wesche and paribakht 1996, Bbc reporter breaks down on air ... Berstorff hannover jobs, Quinoa foodgloss, Mac mini running windows...
Read more >
Configuration properties | Bitbucket Data Center and Server 8.6
If a node fails or is shut down, users do not have to log in again. ... If multiple instances are run on...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found