`cml runner` aggressively shutting down instance with active job running
See original GitHub issueSimilar to #808, we have been seeing our GCP VM instance shutdown randomly in the first few minutes even though a job is still running (we noticed that the GitHub pull, etc. starts on the runner so I don’t think it’s an authentication based problem from the runner) log below from the VM. We have been trying to get CML working on an n2-standard-4
(slowly beefing up the server) which has 4 vCPU’s, along with 16GB ram, a 50GB HDD, and 10GBps network. Is there anyway to debug what might be causing the issues on the VM (i.e. a flag that will stop the aggresive instance destruction so that we can debug this).
Workflow File:
name: CML
env:
GCP_PROJECT_ID: ***
GOOGLE_APPLICATION_CREDENTIALS_DATA: ${{ secrets.GCP_CREDENTIALS }}
on:
pull_request:
branches:
- main
jobs:
deploy-runner:
runs-on: ubuntu-latest
steps:
- uses: navikt/github-app-token-generator@v1
id: get_token
with:
private-key: ${{ secrets.CML_GITHUB_APP_PEM }}
app-id: ${{ secrets.CML_GITHUB_APP_ID }}
- run: echo "REPO_TOKEN=$REPO_TOKEN" >> "$GITHUB_ENV"
env:
REPO_TOKEN: ${{ steps.get_token.outputs.token }}
- uses: iterative/setup-cml@v1
- uses: actions/checkout@v2
- name: Deploy runner
run: |
cml runner \
--cloud gcp \
--cloud-region us-west1-a \
--cloud-type n2-standard-8 \
--cloud-gpu nogpu \
--cloud-hdd-size 50 \
--single \
--idle-timeout 100000
train-model:
runs-on: [self-hosted, cml]
needs: deploy-runner
timeout-minutes: 4320 # 72h
container:
image: docker://iterativeai/cml:0-dvc2-base1
env:
REPO_TOKEN: ${{ secrets.GITHUB_TOKEN }}
steps:
- uses: actions/checkout@v2
- name: Train model
run: |
python test_cml.py
# Create CML report
cat test_cml.txt >> report.md
cml send-comment report.md
Output Logs:
-- Logs begin at Thu 2022-06-09 05:59:19 UTC. --
Jun 09 06:01:32 cml-xav4fkawcs systemd[1]: Started cml.service.
Jun 09 06:01:38 cml-xav4fkawcs cml.sh[21654]: {"level":"info","message":"Preparing workdir /tmp/tmp.OIaVby26sE/.cml/cml-xav4fkawcs..."}
Jun 09 06:01:38 cml-xav4fkawcs cml.sh[21654]: {"level":"info","message":"Launching github runner"}
Jun 09 06:01:55 cml-xav4fkawcs cml.sh[21654]: {"level":"warn","message":"SpotNotifier can not be started."}
Jun 09 06:01:56 cml-xav4fkawcs cml.sh[21654]: {"date":"2022-06-09T06:01:56.446Z","level":"info","message":"runner status","repo":"https://github.com/spark-64/***"}
Jun 09 06:01:56 cml-xav4fkawcs cml.sh[21654]: {"date":"2022-06-09T06:01:56.447Z","level":"info","message":"runner status √ Connected to GitHub","repo":"https://github.com/spark-64/***"}
Jun 09 06:01:56 cml-xav4fkawcs cml.sh[21654]: {"date":"2022-06-09T06:01:56.856Z","level":"info","message":"runner status Current runner version: '2.292.0'","repo":"https://github.com/spark-64/***"}
Jun 09 06:01:56 cml-xav4fkawcs cml.sh[21654]: {"date":"2022-06-09T06:01:56.857Z","level":"info","message":"runner status Listening for Jobs","repo":"https://github.com/spark-64/***","status":"ready"}
Jun 09 06:02:07 cml-xav4fkawcs cml.sh[21654]: {"date":"2022-06-09T06:02:07.077Z","job":"gh","level":"info","message":"runner status Running job: train-model","repo":"https://github.com/spark-64/***","status":"job_started"}
GitHub Actions Hanging:
My colleague @TessaPhillips, and I have been trying to figure this out to no avail - more than happy to contribute if you can point us in the right direction!
Issue Analytics
- State:
- Created a year ago
- Comments:12 (12 by maintainers)
Top GitHub Comments
It does appear that they have made changes for self-hosted as well as their provided runners 🙃
@danieljimeneznz feel free to join our Discord if you have more questions not directly related to an issue