GCP VM instance not terminating after timeout
See original GitHub issueSimilar issue to https://github.com/iterative/cml/issues/678
I’m starting a self hosted runner via Gitlab CICD to GCP:
deploy-runner:
stage: start runner
image: iterativeai/cml:0-dvc2-base1
resource_group: all
script:
- cml runner --cloud=gcp --cloud-region=eu-north --cloud-type=c2-standard-4 --labels=cml-runner --reuse --idle-timeout 600
After the timeout the VM instance is not shutting down.
journalctl --unit cml --no-pager
command shows
-- Logs begin at Sat 2021-12-04 18:40:23 UTC, end at Sat 2021-12-04 19:02:12 UTC. --
Dec 04 18:43:29 cml-4ejd6b8lzc systemd[1]: Started cml.service.
Dec 04 18:43:37 cml-4ejd6b8lzc cml.sh[17099]: {"level":"info","message":"Preparing workdir /tmp/tmp.rtgpwktKf5/.cml/cml-4ejd6b8lzc..."}
Dec 04 18:43:37 cml-4ejd6b8lzc cml.sh[17099]: {"level":"info","message":"Launching gitlab runner"}
Dec 04 18:43:41 cml-4ejd6b8lzc cml.sh[17099]: {"level":"warn","message":"SpotNotifier can not be started."}
Dec 04 18:43:41 cml-4ejd6b8lzc cml.sh[17099]: {"date":"2021-12-04T18:43:41.453Z","level":"info","message":"runner status","repo":"https://gitlab.com/common-kube/ml"}
Dec 04 18:43:41 cml-4ejd6b8lzc cml.sh[17099]: {"date":"2021-12-04T18:43:41.454Z","level":"info","message":"runner status","repo":"https://gitlab.com/common-kube/ml","status":"ready"}
Dec 04 18:43:42 cml-4ejd6b8lzc cml.sh[17099]: {"date":"2021-12-04T18:43:42.276Z","job":1850241152,"level":"info","message":"runner status","repo":"https://gitlab.com/common-kube/ml","status":"job_started"}
Dec 04 18:44:38 cml-4ejd6b8lzc cml.sh[17099]: {"date":"2021-12-04T18:44:38.904Z","level":"info","message":"runner status","repo":"https://gitlab.com/common-kube/ml"}
Dec 04 18:45:46 cml-4ejd6b8lzc cml.sh[17099]: {"date":"2021-12-04T18:45:46.363Z","job":1850241152,"level":"info","message":"runner status","repo":"https://gitlab.com/common-kube/ml"}
Dec 04 18:45:49 cml-4ejd6b8lzc cml.sh[17099]: {"date":"2021-12-04T18:45:46.363Z","job":1850241152,"level":"info","message":"runner status","repo":"https://gitlab.com/common-kube/ml"}
Dec 04 18:46:26 cml-4ejd6b8lzc cml.sh[17099]: {"date":"2021-12-04T18:46:26.303Z","job":1850241152,"level":"info","message":"runner status","repo":"https://gitlab.com/common-kube/ml","status":"job_ended","success":false}
Dec 04 18:46:26 cml-4ejd6b8lzc cml.sh[17099]: {"date":"2021-12-04T18:46:26.649Z","job":1850241152,"level":"info","message":"runner status","repo":"https://gitlab.com/common-kube/ml"}
Dec 04 18:46:26 cml-4ejd6b8lzc cml.sh[17099]: {"date":"2021-12-04T18:46:26.952Z","job":1850241152,"level":"info","message":"runner status","repo":"https://gitlab.com/common-kube/ml"}
Dec 04 18:46:26 cml-4ejd6b8lzc cml.sh[17099]: {"date":"2021-12-04T18:46:26.953Z","level":"info","message":"runner status","repo":"https://gitlab.com/common-kube/ml"}
Dec 04 18:56:28 cml-4ejd6b8lzc cml.sh[17099]: {"level":"info","message":"runner status","reason":"timeout:600","status":"terminated"}
Dec 04 18:56:28 cml-4ejd6b8lzc cml.sh[17099]: {"level":"info","message":"waiting 20 seconds before exiting..."}
Dec 04 18:56:48 cml-4ejd6b8lzc cml.sh[17099]: {"level":"info","message":"Unregistering runner cml-4ejd6b8lzc..."}
Dec 04 18:56:49 cml-4ejd6b8lzc cml.sh[17099]: {"level":"info","message":"\tSuccess"}
Dec 04 18:56:50 cml-4ejd6b8lzc systemd[1]: cml.service: Succeeded.
The runner picks up a job correctly and the runner deregisters itself from the Gitlab project. The VM instance just does not shutdown.
On Azure similar config worked ok and the instances were shutting down
Issue Analytics
- State:
- Created 2 years ago
- Comments:14 (10 by maintainers)
Top Results From Across the Web
Troubleshooting VM suspension - Compute Engine
If you could not suspend a VM, it could be due to one of the following reasons. ... You cannot suspend an instance...
Read more >Google Cloud ssh timeout: how to increase session time?
Google cloud has a session timeout across the board of 10 minutes, so you need to use a keepalive . Try adding the...
Read more >GCP VM Instance dysfunctioning - Server Fault
service: Start operation timed out. Terminating." Tried to stop the instance and restart it. No improvement. Tried to reboot with the commands ...
Read more >Global TCP load balancer times out connection when only ...
Per this document, “ idle TCP connections are disconnected after 10 minutes. If your instance initiates or accepts long-lived connections with ...
Read more >Resolving "Connection refused" or "Connection timed out ...
#CloudComputing #AmazonWebServices #AWS. Resolving "Connection refused" or "Connection timed out" errors connecting to my EC2 Instance.
Read more >Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start FreeTop Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Top GitHub Comments
Can be closed with TPI fix: https://github.com/iterative/terraform-provider-iterative/pull/333
Hmm, it sounds like some documentation clarification might be required?
Under the hood,
cml runner
adds theGOOGLE_APPLICATION_CREDENTIALS_DATA
that cml was invoked with into the systemd service unit as those should be the credentials used for the creation of the instance and thus also should be used for the teardown of the instance.The
--cloud-permission-set
takes (in GCP’s case) the service account email to attach to the instance, the intent behind that is for the application or ML model to use to access other services from the cloud provider like s3/object storage.Are you saying it looks like terraform tried to use those (the
--cloud-permission-set
) creds instead of the originalcml runner
ones? That is definitely not intended.This should be easy for me to reproduce and I’ll try to get it fixed soon, if you are on discord and willing to test out a patch I can let you know when I have something working (dabarnes on discord)