Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

GCP cloud runner not terminating

See original GitHub issue

This is a repeat of #661, which was supposedly fixed in #653. Unfortunately, I’m not seeing any changes in the shutdown behavior of my GCP compute instances. That is, they keep running past the timeout interval.

I’m using the same workflow as before (in #661):

name: 'Train-in-the-cloud-GCP'
on: 
  workflow_dispatch:

jobs:
  deploy-runner:
    runs-on: [ubuntu-latest]
    steps:
      - uses: iterative/setup-cml@v1
      - uses: actions/checkout@v2
      - name: 'Deploy runner on GCP'
        shell: bash
        env:
          REPO_TOKEN: ${{ secrets.PERSONAL_ACCESS_TOKEN }}
          # Notice use of `GOOGLE_APPLICATION_CREDENTIALS_DATA` instead of
          # `GOOGLE_APPLICATION_CREDENTIALS`. Contrary to what docs suggest, the
          # latter causes problems for terraform.
          GOOGLE_APPLICATION_CREDENTIALS_DATA: ${{ secrets.GOOGLE_APPLICATION_CREDENTIALS }}
        run: |
          cml-runner \
          --cloud gcp \
          --cloud-region europe-west1-b	 \
          --cloud-type=n1-standard-1 \
          --labels=cml-runner
          
  model-training:
    needs: deploy-runner
    runs-on: [self-hosted, cml-runner]
    container: docker://dvcorg/cml-py3:latest
    steps:
      - uses: actions/checkout@v2
      - name: 'Train my dummy model'
        env:
          REPO_TOKEN: ${{ secrets.PERSONAL_ACCESS_TOKEN }}
        run: |
          echo "Training a super awesome model"
          sleep 5
          echo "Training complete"

Anyway, this seems to contradict the tests, as @DavidGOrtega explains in the comments under #653:

[…] tests with TPI indicates that the instances are disposed after the expected time.

Any idea what I might be doing wrong?

Issue Analytics

State:
Created 2 years ago
Reactions:1
Comments:7 (5 by maintainers)

Top GitHub Comments

3reactions

lemonthemecommented, Oct 19, 2021

Hi @dacbd, sorry to keep you waiting. Been a while since I looked at this.

Anyway, I’m happy to confirm that instances are now indeed stopped and deleted as expected! 😃 That’s using the exact same workflow as above. Great to see you’ve made progress with this. Thanks!

3reactions

dacbdcommented, Oct 14, 2021

@lemontheme I believe this issue is resolved, can you confirm your workflow is functional without any workarounds?

Top Results From Across the Web

Troubleshoot Cloud Run issues

If requests are terminating with error code 503 before reaching the request timeout set in Cloud Run, you might need to update the...

cloud run is closing the container even if my script is still ...

I want to run a long-running job on cloud run. This is a red herring. On Cloud Run, there's no guarantee that the...

runner

--no-retry : Don't restart the workflow when terminated due to instance disposal or GitHub Actions timeout. --single : Terminate runner after one workflow ......

About self-hosted runners

If a self-hosted runner does not start executing the job within this limit, the job is terminated and fails to complete. API requests...

The Kubernetes executor for GitLab Runner

When empty, it does not define the allowPrivilegeEscalation flag in the container ... Duration after the processes running in the pod are sent...