cml runner fails to provision VM on Azure
See original GitHub issueI used the following as simple “hello world” for cml runner
on GitLab Community Edition [14.10.1]:
deploy-runner:
image: iterativeai/cml:latest
script:
- |
cml runner \
--cloud=azure \
--cloud-region=eu-west \
--cloud-type=s \
--cloud-spot \
--labels=cml-vm
train-model:
needs: [deploy-runner]
tags:
- cml-vm
image: ubuntu:latest
script:
- echo "hello"
I set up
- AZURE_CLIENT_ID
- AZURE_CLIENT_SECRET
- AZURE_SUBSCRIPTION_ID
- AZURE_TENANT_ID
- REPO_TOKEN as well as PERSONAL_ACCESS_TOKEN as I found the documentation confusing about this (https://cml.dev/doc/self-hosted-runners?tab=GitLab#personal-access-token)
Unfortunately this results in:
$ cml runner \ # collapsed multi-line command
{"level":"info","message":"Preparing workdir /home/runner..."}
{"level":"info","message":"Deploying cloud runner plan..."}
{"level":"info","message":"Terraform apply..."}
{"level":"error","message":"terraform -chdir='/home/runner' apply -auto-approve\n\t\nTerraform used the selected providers to generate the following execution\nplan. Resource actions are indicated with the following symbols:\n + create\n\nTerraform will perform the following actions:\n\n # iterative_cml_runner.runner will be created\n + resource \"iterative_cml_runner\" \"runner\" {\n + cloud = \"azure\"\n + cml_version = \"0.15.2\"\n + docker_volumes = []\n + driver = \"gitlab\"\n + id = (known after apply)\n + idle_timeout = 300\n + instance_hdd_size = 35\n + instance_ip = (known after apply)\n + instance_launch_time = (known after apply)\n + instance_type = \"s\"\n + labels = \"cml-vm\"\n + name = \"cml-elde6fnyv0\"\n + region = \"eu-west\"\n + repo = \"https://git.eon-cds.de/F18771/cml-runner-blobfuse\"\n + single = false\n + spot = true\n + spot_price = -1\n + ssh_public = (known after apply)\n + token = (sensitive value)\n }\n\nPlan: 1 to add, 0 to change, 0 to destroy.\niterative_cml_runner.runner: Creating...\niterative_cml_runner.runner: Still creating... [10s elapsed]\niterative_cml_runner.runner: Still creating... [20s elapsed]\niterative_cml_runner.runner: Still creating... [30s elapsed]\niterative_cml_runner.runner: Still creating... [40s elapsed]\niterative_cml_runner.runner: Still creating... [50s elapsed]\niterative_cml_runner.runner: Still creating... [1m0s elapsed]\niterative_cml_runner.runner: Still creating... [1m10s elapsed]\niterative_cml_runner.runner: Still creating... [1m20s elapsed]\niterative_cml_runner.runner: Still creating... [1m30s elapsed]\niterative_cml_runner.runner: Still creating... [1m40s elapsed]\niterative_cml_runner.runner: Still creating... [1m50s elapsed]\niterative_cml_runner.runner: Still creating... [2m0s elapsed]\niterative_cml_runner.runner: Still creating... [2m10s elapsed]\niterative_cml_runner.runner: Still creating... [2m20s elapsed]\niterative_cml_runner.runner: Still creating... [2m30s elapsed]\niterative_cml_runner.runner: Still creating... [2m40s elapsed]\niterative_cml_runner.runner: Still creating... [2m50s elapsed]\niterative_cml_runner.runner: Still creating... [3m0s elapsed]\niterative_cml_runner.runner: Still creating... [3m10s elapsed]\niterative_cml_runner.runner: Still creating... [3m20s elapsed]\niterative_cml_runner.runner: Still creating... [3m30s elapsed]\niterative_cml_runner.runner: Still creating... [3m40s elapsed]\n\n\t╷\n│ Error: Failed creating the machine: azure.BearerAuthorizer#WithAuthorization: Failed to refresh the Token for request to https://management.azure.com/subscriptions/358ccc6d-ab83-4b18-a484-e992f284b7cc/resourcegroups/iterative-37d31qzqeb13b?api-version=2020-06-01: StatusCode=404 -- Original Error: adal: Refresh request failed. Status Code = '404'. Response body: <?xml version=\"1.0\" encoding=\"iso-8859-1\"?>\n│ <!DOCTYPE html PUBLIC \"-//W3C//DTD XHTML 1.0 Transitional//EN\"\n│ \t\t \"http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd\">\n│ <html xmlns=\"http://www.w3.org/1999/xhtml\" xml:lang=\"en\" lang=\"en\">\n│ <head>\n│ <title>404 - Not Found</title>\n│ </head>\n│ <body>\n│ <h1>404 - Not Found</h1>\n│ </body>\n│ </html>\n│ Endpoint http://169.254.169.254/metadata/identity/oauth2/token?api-version=2018-02-01&client_id=[MASKED]&resource=https%3A%2F%2Fmanagement.azure.com%2F\n│ \n│ with iterative_cml_runner.runner,\n│ on main.tf line 8, in resource \"iterative_cml_runner\" \"runner\":\n│ 8: resource \"iterative_cml_runner\" \"runner\" {\n│ \n╵\n","stack":"Error: terraform -chdir='/home/runner' apply -auto-approve\n\t\nTerraform used the selected providers to generate the following execution\nplan. Resource actions are indicated with the following symbols:\n + create\n\nTerraform will perform the following actions:\n\n # iterative_cml_runner.runner will be created\n + resource \"iterative_cml_runner\" \"runner\" {\n + cloud = \"azure\"\n + cml_version = \"0.15.2\"\n + docker_volumes = []\n + driver = \"gitlab\"\n + id = (known after apply)\n + idle_timeout = 300\n + instance_hdd_size = 35\n + instance_ip = (known after apply)\n + instance_launch_time = (known after apply)\n + instance_type = \"s\"\n + labels = \"cml-vm\"\n + name = \"cml-elde6fnyv0\"\n + region = \"eu-west\"\n + repo = \"https://git.eon-cds.de/F18771/cml-runner-blobfuse\"\n + single = false\n + spot = true\n + spot_price = -1\n + ssh_public = (known after apply)\n + token = (sensitive value)\n }\n\nPlan: 1 to add, 0 to change, 0 to destroy.\niterative_cml_runner.runner: Creating...\niterative_cml_runner.runner: Still creating... [10s elapsed]\niterative_cml_runner.runner: Still creating... [20s elapsed]\niterative_cml_runner.runner: Still creating... [30s elapsed]\niterative_cml_runner.runner: Still creating... [40s elapsed]\niterative_cml_runner.runner: Still creating... [50s elapsed]\niterative_cml_runner.runner: Still creating... [1m0s elapsed]\niterative_cml_runner.runner: Still creating... [1m10s elapsed]\niterative_cml_runner.runner: Still creating... [1m20s elapsed]\niterative_cml_runner.runner: Still creating... [1m30s elapsed]\niterative_cml_runner.runner: Still creating... [1m40s elapsed]\niterative_cml_runner.runner: Still creating... [1m50s elapsed]\niterative_cml_runner.runner: Still creating... [2m0s elapsed]\niterative_cml_runner.runner: Still creating... [2m10s elapsed]\niterative_cml_runner.runner: Still creating... [2m20s elapsed]\niterative_cml_runner.runner: Still creating... [2m30s elapsed]\niterative_cml_runner.runner: Still creating... [2m40s elapsed]\niterative_cml_runner.runner: Still creating... [2m50s elapsed]\niterative_cml_runner.runner: Still creating... [3m0s elapsed]\niterative_cml_runner.runner: Still creating... [3m10s elapsed]\niterative_cml_runner.runner: Still creating... [3m20s elapsed]\niterative_cml_runner.runner: Still creating... [3m30s elapsed]\niterative_cml_runner.runner: Still creating... [3m40s elapsed]\n\n\t╷\n│ Error: Failed creating the machine: azure.BearerAuthorizer#WithAuthorization: Failed to refresh the Token for request to https://management.azure.com/subscriptions/358ccc6d-ab83-4b18-a484-e992f284b7cc/resourcegroups/iterative-37d31qzqeb13b?api-version=2020-06-01: StatusCode=404 -- Original Error: adal: Refresh request failed. Status Code = '404'. Response body: <?xml version=\"1.0\" encoding=\"iso-8859-1\"?>\n│ <!DOCTYPE html PUBLIC \"-//W3C//DTD XHTML 1.0 Transitional//EN\"\n│ \t\t \"http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd\">\n│ <html xmlns=\"http://www.w3.org/1999/xhtml\" xml:lang=\"en\" lang=\"en\">\n│ <head>\n│ <title>404 - Not Found</title>\n│ </head>\n│ <body>\n│ <h1>404 - Not Found</h1>\n│ </body>\n│ </html>\n│ Endpoint http://169.254.169.254/metadata/identity/oauth2/token?api-version=2018-02-01&client_id=[MASKED]&resource=https%3A%2F%2Fmanagement.azure.com%2F\n│ \n│ with iterative_cml_runner.runner,\n│ on main.tf line 8, in resource \"iterative_cml_runner\" \"runner\":\n│ 8: resource \"iterative_cml_runner\" \"runner\" {\n│ \n╵\n\n at /usr/lib/node_modules/@dvcorg/cml/src/utils.js:20:27\n at ChildProcess.exithandler (node:child_process:406:5)\n at ChildProcess.emit (node:events:527:28)\n at maybeClose (node:internal/child_process:1092:16)\n at Process.ChildProcess._handle.onexit (node:internal/child_process:302:5)","status":"terminated"}
{"level":"info","message":"waiting 10 seconds before exiting..."}
I and my team could not understand what the problem is.
Additional info:
- I tried to use cml also with the
iterativeai/cml:0-dvc2-base1
docker image - I tried to use Azure specific type and region, but no success
Any help would be very much appreciated.
Issue Analytics
- State:
- Created a year ago
- Comments:17 (16 by maintainers)
Top Results From Across the Web
Troubleshoot Windows VM deployment in Azure
When you try to create a new Azure Virtual Machine (VM), the common errors you encounter are provisioning failures or allocation failures.
Read more >Troubleshoot Linux VM deployment - Virtual Machines
When you try to create a new Azure Virtual Machine (VM), the common errors you encounter are provisioning failures or allocation failures.
Read more >VM extension provisioning errors in Virtual Machine Scale Sets
If the extension has not failed on every instance, add new instances to the Virtual Machine Scale Set and see if the extension...
Read more >Troubleshooting Azure VM allocation failures - Virtual Machines
Troubleshoot allocation failures when you create, restart, or resize VMs in Azure · Error code: AllocationFailed or ZonalAllocationFailed · Error ...
Read more >runner | CML
This script counts towards the total provisioning time. The total exceeding 10 minutes is considered a failure, resulting in cml runner terminating the...
Read more >Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start FreeTop Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Top GitHub Comments
Yes, still not supported on Azure, but please upvote & consider watching the following issues:
Note that
--cloud-permission-set
is not related to your issue, though: it’s just to use managed identities inside your workflows.@iterative/cml, any objections to <kbd>wontfix</kbd> for now?