question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Issues launching AWS clusters

See original GitHub issue

Good Afternoon Team,

Hope all is well!

Wanted to reach out as I am experiencing issues with launching toil on AWS clusters as of late:

toil launch-cluster xxx --leaderNodeType t2.2xlarge -z us-east-2a --keyPairName xxx --leaderStorage 100
[2021-09-14T11:39:41-0400] [MainThread] [I] [toil] Using default docker registry of quay.io/ucsc_cgl as TOIL_DOCKER_REGISTRY is not set.
[2021-09-14T11:39:41-0400] [MainThread] [I] [toil] Using default docker name of toil as TOIL_DOCKER_NAME is not set.
[2021-09-14T11:39:41-0400] [MainThread] [I] [toil] Using default docker appliance of quay.io/ucsc_cgl/toil:5.4.0-87293d63fa6c76f03bed3adf93414ffee67bf9a7-py3.6 as TOIL_APPLIANCE_SELF is not set.
[2021-09-14T11:39:41-0400] [MainThread] [I] [toil.utils.toilLaunchCluster] Creating cluster xxx...
[2021-09-14T11:39:44-0400] [MainThread] [I] [toil] Using default user-defined custom docker init command of  as TOIL_CUSTOM_DOCKER_INIT_COMMAND is not set.
[2021-09-14T11:39:44-0400] [MainThread] [I] [toil] Using default user-defined custom init command of  as TOIL_CUSTOM_INIT_COMMAND is not set.
[2021-09-14T11:39:44-0400] [MainThread] [I] [toil] Using default docker registry of quay.io/ucsc_cgl as TOIL_DOCKER_REGISTRY is not set.
[2021-09-14T11:39:44-0400] [MainThread] [I] [toil] Using default docker name of toil as TOIL_DOCKER_NAME is not set.
[2021-09-14T11:39:44-0400] [MainThread] [I] [toil] Using default docker appliance of quay.io/ucsc_cgl/toil:5.4.0-87293d63fa6c76f03bed3adf93414ffee67bf9a7-py3.6 as TOIL_APPLIANCE_SELF is not set.
[2021-09-14T11:39:45-0400] [MainThread] [I] [toil.lib.ec2] Selected Flatcar AMI: ami-07e82385de8861b75
[2021-09-14T11:39:45-0400] [MainThread] [I] [toil.lib.ec2] Creating t2.2xlarge instance(s) ...
[2021-09-14T11:39:51-0400] [MainThread] [I] [toil.lib.ec2] Creating t2.2xlarge instance(s) ...
[2021-09-14T11:40:26-0400] [MainThread] [I] [toil.provisioners.node] Attempting to establish SSH connection...
[2021-09-14T11:40:27-0400] [MainThread] [I] [toil.provisioners.node] ...SSH connection established.
[2021-09-14T11:40:27-0400] [MainThread] [I] [toil.provisioners.node] Waiting for docker on xxx to start...
[2021-09-14T11:40:28-0400] [MainThread] [I] [toil.provisioners.node] Docker daemon running
[2021-09-14T11:40:28-0400] [MainThread] [I] [toil.provisioners.node] Waiting for toil_leader Toil appliance to start...
[2021-09-14T11:40:28-0400] [MainThread] [I] [toil.provisioners.node] ...Still waiting for appliance, trying again in 20 sec...
[2021-09-14T11:40:48-0400] [MainThread] [I] [toil.provisioners.node] ...Still waiting for appliance, trying again in 20 sec...
[2021-09-14T11:41:09-0400] [MainThread] [I] [toil.provisioners.node] ...Still waiting for appliance, trying again in 20 sec...
[2021-09-14T11:41:30-0400] [MainThread] [I] [toil.provisioners.node] ...Still waiting for appliance, trying again in 20 sec...
[2021-09-14T11:41:50-0400] [MainThread] [I] [toil.provisioners.node] ...Still waiting for appliance, trying again in 20 sec...
[2021-09-14T11:42:11-0400] [MainThread] [I] [toil.provisioners.node] ...Still waiting for appliance, trying again in 20 sec...
[2021-09-14T11:42:31-0400] [MainThread] [I] [toil.provisioners.node] ...Still waiting for appliance, trying again in 20 sec...
[2021-09-14T11:42:52-0400] [MainThread] [I] [toil.provisioners.node] ...Still waiting for appliance, trying again in 20 sec...
[2021-09-14T11:43:12-0400] [MainThread] [I] [toil.provisioners.node] ...Still waiting for appliance, trying again in 20 sec...
[2021-09-14T11:43:33-0400] [MainThread] [I] [toil.provisioners.node] ...Still waiting for appliance, trying again in 20 sec...
[2021-09-14T11:43:53-0400] [MainThread] [I] [toil.provisioners.node] ...Still waiting for appliance, trying again in 20 sec...
[2021-09-14T11:44:13-0400] [MainThread] [I] [toil.provisioners.node] ...Still waiting for appliance, trying again in 20 sec...
[2021-09-14T11:44:34-0400] [MainThread] [I] [toil.provisioners.node] ...Still waiting for appliance, trying again in 20 sec...
[2021-09-14T11:44:54-0400] [MainThread] [I] [toil.provisioners.node] ...Still waiting for appliance, trying again in 20 sec...
[2021-09-14T11:45:15-0400] [MainThread] [I] [toil.provisioners.node] ...Still waiting for appliance, trying again in 20 sec...
[2021-09-14T11:45:35-0400] [MainThread] [I] [toil.provisioners.node] ...Still waiting for appliance, trying again in 20 sec...
[2021-09-14T11:45:56-0400] [MainThread] [I] [toil.provisioners.node] ...Still waiting for appliance, trying again in 20 sec...
[2021-09-14T11:46:16-0400] [MainThread] [I] [toil.provisioners.node] ...Still waiting for appliance, trying again in 20 sec...
[2021-09-14T11:46:37-0400] [MainThread] [I] [toil.provisioners.node] ...Still waiting for appliance, trying again in 20 sec...
[2021-09-14T11:46:57-0400] [MainThread] [I] [toil.provisioners.node] ...Still waiting for appliance, trying again in 20 sec...
[2021-09-14T11:47:18-0400] [MainThread] [I] [toil.provisioners.node] ...Still waiting for appliance, trying again in 20 sec...
Traceback (most recent call last):
  File "/Users/jpuerto/toil-test/venv/bin/toil", line 8, in <module>
    sys.exit(main())
  File "/Users/jpuerto/toil-test/venv/lib/python3.7/site-packages/toil/utils/toilMain.py", line 31, in main
    get_or_die(module, 'main')()
  File "/Users/jpuerto/toil-test/venv/lib/python3.7/site-packages/toil/utils/toilLaunchCluster.py", line 168, in main
    awsEc2ExtraSecurityGroupIds=options.awsEc2ExtraSecurityGroupIds)
  File "/Users/jpuerto/toil-test/venv/lib/python3.7/site-packages/toil/provisioners/aws/awsProvisioner.py", line 298, in launchCluster
    leaderNode.waitForNode('toil_leader')
  File "/Users/jpuerto/toil-test/venv/lib/python3.7/site-packages/toil/provisioners/node.py", line 75, in waitForNode
    self._waitForAppliance(role=role, keyName=keyName)
  File "/Users/jpuerto/toil-test/venv/lib/python3.7/site-packages/toil/provisioners/node.py", line 171, in _waitForAppliance
    "\nCheck if TOIL_APPLIANCE_SELF is set correctly and the container exists.")
RuntimeError: Appliance failed to start on machine with IP: xxxx
Check if TOIL_APPLIANCE_SELF is set correctly and the container exists.

Any ideas on what might be going on here? I just updated to latest toil this morning. Please let me know if there is any additional information that might help with debugging.

Best regards,

Juan

┆Issue is synchronized with this Jira Task ┆Issue Number: TOIL-1013

Issue Analytics

  • State:closed
  • Created 2 years ago
  • Comments:6 (3 by maintainers)

github_iconTop GitHub Comments

1reaction
adamnovakcommented, Oct 4, 2021

We’re going to open some new issue(s) about making our reported versions/release tag names make a bit more sense. It sounds like you managed to get a working setup @jpuerto-psc so I am going to close this out.

0reactions
jonathanxu18commented, Sep 28, 2021

@jpuerto-psc The tag has ‘dirty’ in it since you may have had changes in git that weren’t committed yet when you created the docker image. You can take a look here to see when it determines to include it.

The MOTD uses what TOIL_APPLIANCE_SELF is set to when you run make push_docker, so it might be showing the wrong tag if you set it after the image was created. This shows what TOIL_APPLIANCE_SELF defaults to.

Yep, those changes should be merged into master so you could make push_docker from that. The 5.5.0 release also just got pushed this past weekend so you can use that image too.

Read more comments on GitHub >

github_iconTop Results From Across the Web

Troubleshoot why your ECS or EC2 instance can't join ... - AWS
Your Amazon EC2 instance can't register with or join an ECS cluster because of one or more of the following reasons: The ECS...
Read more >
Troubleshoot a cluster - Amazon EMR - AWS Documentation
The following topics will help you figure out what has gone wrong in your cluster and give you suggestions on how to fix...
Read more >
Troubleshoot instance launch issues - AWS Documentation
The following issues prevent you from launching an instance. ... If you are launching instances into a cluster placement group, you can get...
Read more >
AWS ParallelCluster troubleshooting
This section covers how you can troubleshoot node initialization issues. This includes issues where the node fails to launch, power up, or join...
Read more >
AWS ParallelCluster Troubleshooting - AWS Documentation
This section covers how you can troubleshoot node initialization issues. This includes issues where the node fails to launch, power up, or join...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found