`cml-runner` times out with ssh handshake failure
See original GitHub issueHey everyone,
I’m new to CML
and I’m trying to use the cml-runner
in order to schedule some optimisations.
I can see the EC2 instances get properly deployed, but the cml-runner command get stuck in the Terraform apply...
stage. The EC2 instances stay there until the timeout and then shut themselves down. Meanwhile the cml-runner is still stuck waiting for terraform to finish applying. I’m forced to then just cancel the workflow
looking into this a little further, i can see that terraform apply itself times out with a handshake failure. If i look at the details of the EC2 instance that is deployed, i can see that it doesn’t have a public ipv4 address, and it’s in a private subnet (not a part of our default VPC). Isn’t this going to prevent the github actions runner from handshaking with it? Any ideas on what could be causing this behaviour?
Here is the error trace:
Preparing workdir /home/runner/.cml/cml-quzgh0u7bx...
Deploying cloud runner plan...
Terraform apply...
{"level":"error","status":"terminated"}
Error: terraform -chdir='/home/runner/.cml/cml-quzgh0u7bx' apply -auto-approve
iterative_cml_runner.runner: Creating...
iterative_cml_runner.runner: Still creating... [10s elapsed]
iterative_cml_runner.runner: Still creating... [20s elapsed]
iterative_cml_runner.runner: Still creating... [30s elapsed]
iterative_cml_runner.runner: Still creating... [40s elapsed]
iterative_cml_runner.runner: Still creating... [10m20s elapsed]
│ Error: Error checking the runner status
│
│ on main.tf line 14, in resource "iterative_cml_runner" "runner":
│ 14: resource "iterative_cml_runner" "runner" {
│
│
│ ssh: handshake failed: ssh: unable to authenticate, attempted methods [none
│ publickey], no supported methods remain
╵
at /usr/local/lib/node_modules/@dvcorg/cml/src/utils.js:15:27
at ChildProcess.exithandler (child_process.js:315:5)
at ChildProcess.emit (events.js:315:20)
at maybeClose (internal/child_process.js:1048:16)
at Process.ChildProcess._handle.onexit (internal/child_process.js:288:5)
iterative_cml_runner.runner: Refreshing state... [id=iterative-2gjlae369da1t]
iterative_cml_runner.runner: Destroying... [id=iterative-2gjlae369da1t]
iterative_cml_runner.runner: Destruction complete after 1s
Destroy complete! Resources: 1 destroyed.
[0
Error: Process completed with exit code 1.
The github token I’m providing to the script has the following permissions attached:
The AWS credentials have the permissions outlined in #429
The github action file that this is based on is
name: Run-Engine-Tests
on:
push:
branch: spike/poddie_cicd
jobs:
deploy_runners:
name: Deploy EC2 Instances
# strategy:
# matrix:
# batch_id: [0]
# case_name: ['test']
# batch_id: [0, 1, 2]
# case_name: ['ONSHORE_PIPELINE', 'OFFSHORE_PIPELINE']
runs-on: ubuntu-latest
steps:
- name: Checkout code
uses: actions/checkout@v2
- name: Setup CML
uses: iterative/setup-cml@v1
- name: "Deploy runner on EC2"
shell: bash
env:
repo_token: ${{ secrets.ACCESS_TOKEN_CML }}
AWS_ACCESS_KEY_ID: ${{ secrets.AWS_ACCESS_KEY_ID_temp }}
AWS_SECRET_ACCESS_KEY: ${{ secrets.AWS_SECRET_ACCESS_KEY_temp }}
run: |
cml-runner \
--cloud aws \
--cloud-region eu-west-2 \
--cloud-type=t2.micro \
Thanks a lot in advance!
Issue Analytics
- State:
- Created 2 years ago
- Reactions:1
- Comments:8 (4 by maintainers)
Top GitHub Comments
Glad it worked, @thatGreekGuy96! I’m closing this issue in favor of https://github.com/iterative/terraform-provider-iterative/issues/107.
I’m inclined to think that creating an ad-hoc VPC could be better for the user experience, but that’s open for discussion anyway.
@0x2b3bfa0 ok well thanks for your help in any case 😄 Turns out the we didn’t need that other VPC, it was just a leftover from our old infrastructure so we’ve gone ahead and deleted it now and stuff works great!
That being said, you might want to change the behaviour here or to make it a little bit clearer in the docs. Creating a VPC with all the right settings, or allowing the user to set the VPC themselves might do the trick?