question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

GCP VM instance not terminating after timeout

See original GitHub issue

Similar issue to https://github.com/iterative/cml/issues/678

I’m starting a self hosted runner via Gitlab CICD to GCP:

deploy-runner:
  stage: start runner
  image: iterativeai/cml:0-dvc2-base1
  resource_group: all
  script:
    - cml runner --cloud=gcp --cloud-region=eu-north --cloud-type=c2-standard-4 --labels=cml-runner --reuse --idle-timeout 600

After the timeout the VM instance is not shutting down.

journalctl --unit cml --no-pager command shows

-- Logs begin at Sat 2021-12-04 18:40:23 UTC, end at Sat 2021-12-04 19:02:12 UTC. --
Dec 04 18:43:29 cml-4ejd6b8lzc systemd[1]: Started cml.service.
Dec 04 18:43:37 cml-4ejd6b8lzc cml.sh[17099]: {"level":"info","message":"Preparing workdir /tmp/tmp.rtgpwktKf5/.cml/cml-4ejd6b8lzc..."}
Dec 04 18:43:37 cml-4ejd6b8lzc cml.sh[17099]: {"level":"info","message":"Launching gitlab runner"}
Dec 04 18:43:41 cml-4ejd6b8lzc cml.sh[17099]: {"level":"warn","message":"SpotNotifier can not be started."}
Dec 04 18:43:41 cml-4ejd6b8lzc cml.sh[17099]: {"date":"2021-12-04T18:43:41.453Z","level":"info","message":"runner status","repo":"https://gitlab.com/common-kube/ml"}
Dec 04 18:43:41 cml-4ejd6b8lzc cml.sh[17099]: {"date":"2021-12-04T18:43:41.454Z","level":"info","message":"runner status","repo":"https://gitlab.com/common-kube/ml","status":"ready"}
Dec 04 18:43:42 cml-4ejd6b8lzc cml.sh[17099]: {"date":"2021-12-04T18:43:42.276Z","job":1850241152,"level":"info","message":"runner status","repo":"https://gitlab.com/common-kube/ml","status":"job_started"}
Dec 04 18:44:38 cml-4ejd6b8lzc cml.sh[17099]: {"date":"2021-12-04T18:44:38.904Z","level":"info","message":"runner status","repo":"https://gitlab.com/common-kube/ml"}
Dec 04 18:45:46 cml-4ejd6b8lzc cml.sh[17099]: {"date":"2021-12-04T18:45:46.363Z","job":1850241152,"level":"info","message":"runner status","repo":"https://gitlab.com/common-kube/ml"}
Dec 04 18:45:49 cml-4ejd6b8lzc cml.sh[17099]: {"date":"2021-12-04T18:45:46.363Z","job":1850241152,"level":"info","message":"runner status","repo":"https://gitlab.com/common-kube/ml"}
Dec 04 18:46:26 cml-4ejd6b8lzc cml.sh[17099]: {"date":"2021-12-04T18:46:26.303Z","job":1850241152,"level":"info","message":"runner status","repo":"https://gitlab.com/common-kube/ml","status":"job_ended","success":false}
Dec 04 18:46:26 cml-4ejd6b8lzc cml.sh[17099]: {"date":"2021-12-04T18:46:26.649Z","job":1850241152,"level":"info","message":"runner status","repo":"https://gitlab.com/common-kube/ml"}
Dec 04 18:46:26 cml-4ejd6b8lzc cml.sh[17099]: {"date":"2021-12-04T18:46:26.952Z","job":1850241152,"level":"info","message":"runner status","repo":"https://gitlab.com/common-kube/ml"}
Dec 04 18:46:26 cml-4ejd6b8lzc cml.sh[17099]: {"date":"2021-12-04T18:46:26.953Z","level":"info","message":"runner status","repo":"https://gitlab.com/common-kube/ml"}
Dec 04 18:56:28 cml-4ejd6b8lzc cml.sh[17099]: {"level":"info","message":"runner status","reason":"timeout:600","status":"terminated"}
Dec 04 18:56:28 cml-4ejd6b8lzc cml.sh[17099]: {"level":"info","message":"waiting 20 seconds before exiting..."}
Dec 04 18:56:48 cml-4ejd6b8lzc cml.sh[17099]: {"level":"info","message":"Unregistering runner cml-4ejd6b8lzc..."}
Dec 04 18:56:49 cml-4ejd6b8lzc cml.sh[17099]: {"level":"info","message":"\tSuccess"}
Dec 04 18:56:50 cml-4ejd6b8lzc systemd[1]: cml.service: Succeeded.

The runner picks up a job correctly and the runner deregisters itself from the Gitlab project. The VM instance just does not shutdown.

On Azure similar config worked ok and the instances were shutting down

Issue Analytics

  • State:closed
  • Created 2 years ago
  • Comments:14 (10 by maintainers)

github_iconTop GitHub Comments

2reactions
dacbdcommented, Dec 14, 2021
2reactions
dacbdcommented, Dec 13, 2021

Hmm, it sounds like some documentation clarification might be required?

Under the hood, cml runner adds the GOOGLE_APPLICATION_CREDENTIALS_DATA that cml was invoked with into the systemd service unit as those should be the credentials used for the creation of the instance and thus also should be used for the teardown of the instance.

The --cloud-permission-set takes (in GCP’s case) the service account email to attach to the instance, the intent behind that is for the application or ML model to use to access other services from the cloud provider like s3/object storage.

Are you saying it looks like terraform tried to use those (the --cloud-permission-set) creds instead of the original cml runner ones? That is definitely not intended.

This should be easy for me to reproduce and I’ll try to get it fixed soon, if you are on discord and willing to test out a patch I can let you know when I have something working (dabarnes on discord)

Read more comments on GitHub >

github_iconTop Results From Across the Web

Troubleshooting VM suspension - Compute Engine
If you could not suspend a VM, it could be due to one of the following reasons. ... You cannot suspend an instance...
Read more >
Google Cloud ssh timeout: how to increase session time?
Google cloud has a session timeout across the board of 10 minutes, so you need to use a keepalive . Try adding the...
Read more >
GCP VM Instance dysfunctioning - Server Fault
service: Start operation timed out. Terminating." Tried to stop the instance and restart it. No improvement. Tried to reboot with the commands ...
Read more >
Global TCP load balancer times out connection when only ...
Per this document, “ idle TCP connections are disconnected after 10 minutes. If your instance initiates or accepts long-lived connections with ...
Read more >
Resolving "Connection refused" or "Connection timed out ...
#CloudComputing #AmazonWebServices #AWS. Resolving "Connection refused" or "Connection timed out" errors connecting to my EC2 Instance.
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found