question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

cml runner fails to provision VM on Azure

See original GitHub issue

I used the following as simple “hello world” for cml runner on GitLab Community Edition [14.10.1]:

deploy-runner:
  image: iterativeai/cml:latest
  script:
    - |
      cml runner \
          --cloud=azure \
          --cloud-region=eu-west \
          --cloud-type=s \
          --cloud-spot \
          --labels=cml-vm

train-model:
  needs: [deploy-runner]
  tags:
    - cml-vm
  image: ubuntu:latest
  script:
    - echo "hello"

I set up

Unfortunately this results in:

$ cml runner \ # collapsed multi-line command
{"level":"info","message":"Preparing workdir /home/runner..."}
{"level":"info","message":"Deploying cloud runner plan..."}
{"level":"info","message":"Terraform apply..."}
{"level":"error","message":"terraform -chdir='/home/runner' apply -auto-approve\n\t\nTerraform used the selected providers to generate the following execution\nplan. Resource actions are indicated with the following symbols:\n  + create\n\nTerraform will perform the following actions:\n\n  # iterative_cml_runner.runner will be created\n  + resource \"iterative_cml_runner\" \"runner\" {\n      + cloud                = \"azure\"\n      + cml_version          = \"0.15.2\"\n      + docker_volumes       = []\n      + driver               = \"gitlab\"\n      + id                   = (known after apply)\n      + idle_timeout         = 300\n      + instance_hdd_size    = 35\n      + instance_ip          = (known after apply)\n      + instance_launch_time = (known after apply)\n      + instance_type        = \"s\"\n      + labels               = \"cml-vm\"\n      + name                 = \"cml-elde6fnyv0\"\n      + region               = \"eu-west\"\n      + repo                 = \"https://git.eon-cds.de/F18771/cml-runner-blobfuse\"\n      + single               = false\n      + spot                 = true\n      + spot_price           = -1\n      + ssh_public           = (known after apply)\n      + token                = (sensitive value)\n    }\n\nPlan: 1 to add, 0 to change, 0 to destroy.\niterative_cml_runner.runner: Creating...\niterative_cml_runner.runner: Still creating... [10s elapsed]\niterative_cml_runner.runner: Still creating... [20s elapsed]\niterative_cml_runner.runner: Still creating... [30s elapsed]\niterative_cml_runner.runner: Still creating... [40s elapsed]\niterative_cml_runner.runner: Still creating... [50s elapsed]\niterative_cml_runner.runner: Still creating... [1m0s elapsed]\niterative_cml_runner.runner: Still creating... [1m10s elapsed]\niterative_cml_runner.runner: Still creating... [1m20s elapsed]\niterative_cml_runner.runner: Still creating... [1m30s elapsed]\niterative_cml_runner.runner: Still creating... [1m40s elapsed]\niterative_cml_runner.runner: Still creating... [1m50s elapsed]\niterative_cml_runner.runner: Still creating... [2m0s elapsed]\niterative_cml_runner.runner: Still creating... [2m10s elapsed]\niterative_cml_runner.runner: Still creating... [2m20s elapsed]\niterative_cml_runner.runner: Still creating... [2m30s elapsed]\niterative_cml_runner.runner: Still creating... [2m40s elapsed]\niterative_cml_runner.runner: Still creating... [2m50s elapsed]\niterative_cml_runner.runner: Still creating... [3m0s elapsed]\niterative_cml_runner.runner: Still creating... [3m10s elapsed]\niterative_cml_runner.runner: Still creating... [3m20s elapsed]\niterative_cml_runner.runner: Still creating... [3m30s elapsed]\niterative_cml_runner.runner: Still creating... [3m40s elapsed]\n\n\t╷\n│ Error: Failed creating the machine: azure.BearerAuthorizer#WithAuthorization: Failed to refresh the Token for request to https://management.azure.com/subscriptions/358ccc6d-ab83-4b18-a484-e992f284b7cc/resourcegroups/iterative-37d31qzqeb13b?api-version=2020-06-01: StatusCode=404 -- Original Error: adal: Refresh request failed. Status Code = '404'. Response body: <?xml version=\"1.0\" encoding=\"iso-8859-1\"?>\n│ <!DOCTYPE html PUBLIC \"-//W3C//DTD XHTML 1.0 Transitional//EN\"\n│ \t\t \"http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd\">\n│ <html xmlns=\"http://www.w3.org/1999/xhtml\" xml:lang=\"en\" lang=\"en\">\n│  <head>\n│   <title>404 - Not Found</title>\n│  </head>\n│  <body>\n│   <h1>404 - Not Found</h1>\n│  </body>\n│ </html>\n│  Endpoint http://169.254.169.254/metadata/identity/oauth2/token?api-version=2018-02-01&client_id=[MASKED]&resource=https%3A%2F%2Fmanagement.azure.com%2F\n│ \n│   with iterative_cml_runner.runner,\n│   on main.tf line 8, in resource \"iterative_cml_runner\" \"runner\":\n│    8: resource \"iterative_cml_runner\" \"runner\" {\n│ \n╵\n","stack":"Error: terraform -chdir='/home/runner' apply -auto-approve\n\t\nTerraform used the selected providers to generate the following execution\nplan. Resource actions are indicated with the following symbols:\n  + create\n\nTerraform will perform the following actions:\n\n  # iterative_cml_runner.runner will be created\n  + resource \"iterative_cml_runner\" \"runner\" {\n      + cloud                = \"azure\"\n      + cml_version          = \"0.15.2\"\n      + docker_volumes       = []\n      + driver               = \"gitlab\"\n      + id                   = (known after apply)\n      + idle_timeout         = 300\n      + instance_hdd_size    = 35\n      + instance_ip          = (known after apply)\n      + instance_launch_time = (known after apply)\n      + instance_type        = \"s\"\n      + labels               = \"cml-vm\"\n      + name                 = \"cml-elde6fnyv0\"\n      + region               = \"eu-west\"\n      + repo                 = \"https://git.eon-cds.de/F18771/cml-runner-blobfuse\"\n      + single               = false\n      + spot                 = true\n      + spot_price           = -1\n      + ssh_public           = (known after apply)\n      + token                = (sensitive value)\n    }\n\nPlan: 1 to add, 0 to change, 0 to destroy.\niterative_cml_runner.runner: Creating...\niterative_cml_runner.runner: Still creating... [10s elapsed]\niterative_cml_runner.runner: Still creating... [20s elapsed]\niterative_cml_runner.runner: Still creating... [30s elapsed]\niterative_cml_runner.runner: Still creating... [40s elapsed]\niterative_cml_runner.runner: Still creating... [50s elapsed]\niterative_cml_runner.runner: Still creating... [1m0s elapsed]\niterative_cml_runner.runner: Still creating... [1m10s elapsed]\niterative_cml_runner.runner: Still creating... [1m20s elapsed]\niterative_cml_runner.runner: Still creating... [1m30s elapsed]\niterative_cml_runner.runner: Still creating... [1m40s elapsed]\niterative_cml_runner.runner: Still creating... [1m50s elapsed]\niterative_cml_runner.runner: Still creating... [2m0s elapsed]\niterative_cml_runner.runner: Still creating... [2m10s elapsed]\niterative_cml_runner.runner: Still creating... [2m20s elapsed]\niterative_cml_runner.runner: Still creating... [2m30s elapsed]\niterative_cml_runner.runner: Still creating... [2m40s elapsed]\niterative_cml_runner.runner: Still creating... [2m50s elapsed]\niterative_cml_runner.runner: Still creating... [3m0s elapsed]\niterative_cml_runner.runner: Still creating... [3m10s elapsed]\niterative_cml_runner.runner: Still creating... [3m20s elapsed]\niterative_cml_runner.runner: Still creating... [3m30s elapsed]\niterative_cml_runner.runner: Still creating... [3m40s elapsed]\n\n\t╷\n│ Error: Failed creating the machine: azure.BearerAuthorizer#WithAuthorization: Failed to refresh the Token for request to https://management.azure.com/subscriptions/358ccc6d-ab83-4b18-a484-e992f284b7cc/resourcegroups/iterative-37d31qzqeb13b?api-version=2020-06-01: StatusCode=404 -- Original Error: adal: Refresh request failed. Status Code = '404'. Response body: <?xml version=\"1.0\" encoding=\"iso-8859-1\"?>\n│ <!DOCTYPE html PUBLIC \"-//W3C//DTD XHTML 1.0 Transitional//EN\"\n│ \t\t \"http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd\">\n│ <html xmlns=\"http://www.w3.org/1999/xhtml\" xml:lang=\"en\" lang=\"en\">\n│  <head>\n│   <title>404 - Not Found</title>\n│  </head>\n│  <body>\n│   <h1>404 - Not Found</h1>\n│  </body>\n│ </html>\n│  Endpoint http://169.254.169.254/metadata/identity/oauth2/token?api-version=2018-02-01&client_id=[MASKED]&resource=https%3A%2F%2Fmanagement.azure.com%2F\n│ \n│   with iterative_cml_runner.runner,\n│   on main.tf line 8, in resource \"iterative_cml_runner\" \"runner\":\n│    8: resource \"iterative_cml_runner\" \"runner\" {\n│ \n╵\n\n    at /usr/lib/node_modules/@dvcorg/cml/src/utils.js:20:27\n    at ChildProcess.exithandler (node:child_process:406:5)\n    at ChildProcess.emit (node:events:527:28)\n    at maybeClose (node:internal/child_process:1092:16)\n    at Process.ChildProcess._handle.onexit (node:internal/child_process:302:5)","status":"terminated"}
{"level":"info","message":"waiting 10 seconds before exiting..."}

I and my team could not understand what the problem is.

Additional info:

  • I tried to use cml also with the iterativeai/cml:0-dvc2-base1 docker image
  • I tried to use Azure specific type and region, but no success

Any help would be very much appreciated.

Issue Analytics

  • State:closed
  • Created a year ago
  • Comments:17 (16 by maintainers)

github_iconTop GitHub Comments

3reactions
0x2b3bfa0commented, May 27, 2022

Is this new option for GCP and AWS only? no Azure?

Yes, still not supported on Azure, but please upvote & consider watching the following issues:

Note that --cloud-permission-set is not related to your issue, though: it’s just to use managed identities inside your workflows.

2reactions
0x2b3bfa0commented, Oct 12, 2022

@iterative/cml, any objections to <kbd>wontfix</kbd> for now?

Read more comments on GitHub >

github_iconTop Results From Across the Web

Troubleshoot Windows VM deployment in Azure
When you try to create a new Azure Virtual Machine (VM), the common errors you encounter are provisioning failures or allocation failures.
Read more >
Troubleshoot Linux VM deployment - Virtual Machines
When you try to create a new Azure Virtual Machine (VM), the common errors you encounter are provisioning failures or allocation failures.
Read more >
VM extension provisioning errors in Virtual Machine Scale Sets
If the extension has not failed on every instance, add new instances to the Virtual Machine Scale Set and see if the extension...
Read more >
Troubleshooting Azure VM allocation failures - Virtual Machines
Troubleshoot allocation failures when you create, restart, or resize VMs in Azure · Error code: AllocationFailed or ZonalAllocationFailed · Error ...
Read more >
runner | CML
This script counts towards the total provisioning time. The total exceeding 10 minutes is considered a failure, resulting in cml runner terminating the...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found