Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

`cml-runner` times out with ssh handshake failure

See original GitHub issue

Hey everyone, I’m new to CML and I’m trying to use the cml-runner in order to schedule some optimisations.

I can see the EC2 instances get properly deployed, but the cml-runner command get stuck in the Terraform apply... stage. The EC2 instances stay there until the timeout and then shut themselves down. Meanwhile the cml-runner is still stuck waiting for terraform to finish applying. I’m forced to then just cancel the workflow

looking into this a little further, i can see that terraform apply itself times out with a handshake failure. If i look at the details of the EC2 instance that is deployed, i can see that it doesn’t have a public ipv4 address, and it’s in a private subnet (not a part of our default VPC). Isn’t this going to prevent the github actions runner from handshaking with it? Any ideas on what could be causing this behaviour?

Here is the error trace:

Preparing workdir /home/runner/.cml/cml-quzgh0u7bx...
Deploying cloud runner plan...
Terraform apply...
{"level":"error","status":"terminated"}
Error: terraform -chdir='/home/runner/.cml/cml-quzgh0u7bx' apply -auto-approve
	iterative_cml_runner.runner: Creating...
iterative_cml_runner.runner: Still creating... [10s elapsed]
iterative_cml_runner.runner: Still creating... [20s elapsed]
iterative_cml_runner.runner: Still creating... [30s elapsed]
iterative_cml_runner.runner: Still creating... [40s elapsed]
iterative_cml_runner.runner: Still creating... [10m20s elapsed]
│ Error: Error checking the runner status
│ 
│   on main.tf line 14, in resource "iterative_cml_runner" "runner":
│   14: resource "iterative_cml_runner" "runner" {
│ 
│ 
│ ssh: handshake failed: ssh: unable to authenticate, attempted methods [none
│ publickey], no supported methods remain
╵
    at /usr/local/lib/node_modules/@dvcorg/cml/src/utils.js:15:27
    at ChildProcess.exithandler (child_process.js:315:5)
    at ChildProcess.emit (events.js:315:20)
    at maybeClose (internal/child_process.js:1048:16)
    at Process.ChildProcess._handle.onexit (internal/child_process.js:288:5)
iterative_cml_runner.runner: Refreshing state... [id=iterative-2gjlae369da1t]
iterative_cml_runner.runner: Destroying... [id=iterative-2gjlae369da1t]
iterative_cml_runner.runner: Destruction complete after 1s
Destroy complete! Resources: 1 destroyed.
[0
Error: Process completed with exit code 1.

The github token I’m providing to the script has the following permissions attached:

The AWS credentials have the permissions outlined in #429

The github action file that this is based on is

name: Run-Engine-Tests

on:
  push:
    branch: spike/poddie_cicd

jobs:

  deploy_runners:
    name: Deploy EC2 Instances

    # strategy:
    #   matrix:
    #     batch_id: [0]
    #     case_name: ['test']
        # batch_id: [0, 1, 2]
        # case_name: ['ONSHORE_PIPELINE', 'OFFSHORE_PIPELINE']

    runs-on: ubuntu-latest

    steps:
      - name: Checkout code
        uses: actions/checkout@v2

      - name: Setup CML
        uses: iterative/setup-cml@v1

      - name: "Deploy runner on EC2"
        shell: bash
        env:
          repo_token: ${{ secrets.ACCESS_TOKEN_CML }}
          AWS_ACCESS_KEY_ID: ${{ secrets.AWS_ACCESS_KEY_ID_temp }}
          AWS_SECRET_ACCESS_KEY: ${{ secrets.AWS_SECRET_ACCESS_KEY_temp }}

        run: |
          cml-runner \
          --cloud aws \
          --cloud-region eu-west-2 \
          --cloud-type=t2.micro \

Thanks a lot in advance!

Issue Analytics

State:
Created 2 years ago
Reactions:1
Comments:8 (4 by maintainers)

Top GitHub Comments

1reaction

0x2b3bfa0commented, Apr 30, 2021

Glad it worked, @thatGreekGuy96! I’m closing this issue in favor of https://github.com/iterative/terraform-provider-iterative/issues/107.

I’m inclined to think that creating an ad-hoc VPC could be better for the user experience, but that’s open for discussion anyway.

1reaction

thatGreekGuy96commented, Apr 30, 2021

@0x2b3bfa0 ok well thanks for your help in any case 😄 Turns out the we didn’t need that other VPC, it was just a leftover from our old infrastructure so we’ve gone ahead and deleted it now and stuff works great!

That being said, you might want to change the behaviour here or to make it a little bit clearer in the docs. Creating a VPC with all the right settings, or allowing the user to set the VPC themselves might do the trick?