Issues launching AWS clusters
See original GitHub issueGood Afternoon Team,
Hope all is well!
Wanted to reach out as I am experiencing issues with launching toil on AWS clusters as of late:
toil launch-cluster xxx --leaderNodeType t2.2xlarge -z us-east-2a --keyPairName xxx --leaderStorage 100
[2021-09-14T11:39:41-0400] [MainThread] [I] [toil] Using default docker registry of quay.io/ucsc_cgl as TOIL_DOCKER_REGISTRY is not set.
[2021-09-14T11:39:41-0400] [MainThread] [I] [toil] Using default docker name of toil as TOIL_DOCKER_NAME is not set.
[2021-09-14T11:39:41-0400] [MainThread] [I] [toil] Using default docker appliance of quay.io/ucsc_cgl/toil:5.4.0-87293d63fa6c76f03bed3adf93414ffee67bf9a7-py3.6 as TOIL_APPLIANCE_SELF is not set.
[2021-09-14T11:39:41-0400] [MainThread] [I] [toil.utils.toilLaunchCluster] Creating cluster xxx...
[2021-09-14T11:39:44-0400] [MainThread] [I] [toil] Using default user-defined custom docker init command of as TOIL_CUSTOM_DOCKER_INIT_COMMAND is not set.
[2021-09-14T11:39:44-0400] [MainThread] [I] [toil] Using default user-defined custom init command of as TOIL_CUSTOM_INIT_COMMAND is not set.
[2021-09-14T11:39:44-0400] [MainThread] [I] [toil] Using default docker registry of quay.io/ucsc_cgl as TOIL_DOCKER_REGISTRY is not set.
[2021-09-14T11:39:44-0400] [MainThread] [I] [toil] Using default docker name of toil as TOIL_DOCKER_NAME is not set.
[2021-09-14T11:39:44-0400] [MainThread] [I] [toil] Using default docker appliance of quay.io/ucsc_cgl/toil:5.4.0-87293d63fa6c76f03bed3adf93414ffee67bf9a7-py3.6 as TOIL_APPLIANCE_SELF is not set.
[2021-09-14T11:39:45-0400] [MainThread] [I] [toil.lib.ec2] Selected Flatcar AMI: ami-07e82385de8861b75
[2021-09-14T11:39:45-0400] [MainThread] [I] [toil.lib.ec2] Creating t2.2xlarge instance(s) ...
[2021-09-14T11:39:51-0400] [MainThread] [I] [toil.lib.ec2] Creating t2.2xlarge instance(s) ...
[2021-09-14T11:40:26-0400] [MainThread] [I] [toil.provisioners.node] Attempting to establish SSH connection...
[2021-09-14T11:40:27-0400] [MainThread] [I] [toil.provisioners.node] ...SSH connection established.
[2021-09-14T11:40:27-0400] [MainThread] [I] [toil.provisioners.node] Waiting for docker on xxx to start...
[2021-09-14T11:40:28-0400] [MainThread] [I] [toil.provisioners.node] Docker daemon running
[2021-09-14T11:40:28-0400] [MainThread] [I] [toil.provisioners.node] Waiting for toil_leader Toil appliance to start...
[2021-09-14T11:40:28-0400] [MainThread] [I] [toil.provisioners.node] ...Still waiting for appliance, trying again in 20 sec...
[2021-09-14T11:40:48-0400] [MainThread] [I] [toil.provisioners.node] ...Still waiting for appliance, trying again in 20 sec...
[2021-09-14T11:41:09-0400] [MainThread] [I] [toil.provisioners.node] ...Still waiting for appliance, trying again in 20 sec...
[2021-09-14T11:41:30-0400] [MainThread] [I] [toil.provisioners.node] ...Still waiting for appliance, trying again in 20 sec...
[2021-09-14T11:41:50-0400] [MainThread] [I] [toil.provisioners.node] ...Still waiting for appliance, trying again in 20 sec...
[2021-09-14T11:42:11-0400] [MainThread] [I] [toil.provisioners.node] ...Still waiting for appliance, trying again in 20 sec...
[2021-09-14T11:42:31-0400] [MainThread] [I] [toil.provisioners.node] ...Still waiting for appliance, trying again in 20 sec...
[2021-09-14T11:42:52-0400] [MainThread] [I] [toil.provisioners.node] ...Still waiting for appliance, trying again in 20 sec...
[2021-09-14T11:43:12-0400] [MainThread] [I] [toil.provisioners.node] ...Still waiting for appliance, trying again in 20 sec...
[2021-09-14T11:43:33-0400] [MainThread] [I] [toil.provisioners.node] ...Still waiting for appliance, trying again in 20 sec...
[2021-09-14T11:43:53-0400] [MainThread] [I] [toil.provisioners.node] ...Still waiting for appliance, trying again in 20 sec...
[2021-09-14T11:44:13-0400] [MainThread] [I] [toil.provisioners.node] ...Still waiting for appliance, trying again in 20 sec...
[2021-09-14T11:44:34-0400] [MainThread] [I] [toil.provisioners.node] ...Still waiting for appliance, trying again in 20 sec...
[2021-09-14T11:44:54-0400] [MainThread] [I] [toil.provisioners.node] ...Still waiting for appliance, trying again in 20 sec...
[2021-09-14T11:45:15-0400] [MainThread] [I] [toil.provisioners.node] ...Still waiting for appliance, trying again in 20 sec...
[2021-09-14T11:45:35-0400] [MainThread] [I] [toil.provisioners.node] ...Still waiting for appliance, trying again in 20 sec...
[2021-09-14T11:45:56-0400] [MainThread] [I] [toil.provisioners.node] ...Still waiting for appliance, trying again in 20 sec...
[2021-09-14T11:46:16-0400] [MainThread] [I] [toil.provisioners.node] ...Still waiting for appliance, trying again in 20 sec...
[2021-09-14T11:46:37-0400] [MainThread] [I] [toil.provisioners.node] ...Still waiting for appliance, trying again in 20 sec...
[2021-09-14T11:46:57-0400] [MainThread] [I] [toil.provisioners.node] ...Still waiting for appliance, trying again in 20 sec...
[2021-09-14T11:47:18-0400] [MainThread] [I] [toil.provisioners.node] ...Still waiting for appliance, trying again in 20 sec...
Traceback (most recent call last):
File "/Users/jpuerto/toil-test/venv/bin/toil", line 8, in <module>
sys.exit(main())
File "/Users/jpuerto/toil-test/venv/lib/python3.7/site-packages/toil/utils/toilMain.py", line 31, in main
get_or_die(module, 'main')()
File "/Users/jpuerto/toil-test/venv/lib/python3.7/site-packages/toil/utils/toilLaunchCluster.py", line 168, in main
awsEc2ExtraSecurityGroupIds=options.awsEc2ExtraSecurityGroupIds)
File "/Users/jpuerto/toil-test/venv/lib/python3.7/site-packages/toil/provisioners/aws/awsProvisioner.py", line 298, in launchCluster
leaderNode.waitForNode('toil_leader')
File "/Users/jpuerto/toil-test/venv/lib/python3.7/site-packages/toil/provisioners/node.py", line 75, in waitForNode
self._waitForAppliance(role=role, keyName=keyName)
File "/Users/jpuerto/toil-test/venv/lib/python3.7/site-packages/toil/provisioners/node.py", line 171, in _waitForAppliance
"\nCheck if TOIL_APPLIANCE_SELF is set correctly and the container exists.")
RuntimeError: Appliance failed to start on machine with IP: xxxx
Check if TOIL_APPLIANCE_SELF is set correctly and the container exists.
Any ideas on what might be going on here? I just updated to latest toil this morning. Please let me know if there is any additional information that might help with debugging.
Best regards,
Juan
┆Issue is synchronized with this Jira Task ┆Issue Number: TOIL-1013
Issue Analytics
- State:
- Created 2 years ago
- Comments:6 (3 by maintainers)
Top Results From Across the Web
Troubleshoot why your ECS or EC2 instance can't join ... - AWS
Your Amazon EC2 instance can't register with or join an ECS cluster because of one or more of the following reasons: The ECS...
Read more >Troubleshoot a cluster - Amazon EMR - AWS Documentation
The following topics will help you figure out what has gone wrong in your cluster and give you suggestions on how to fix...
Read more >Troubleshoot instance launch issues - AWS Documentation
The following issues prevent you from launching an instance. ... If you are launching instances into a cluster placement group, you can get...
Read more >AWS ParallelCluster troubleshooting
This section covers how you can troubleshoot node initialization issues. This includes issues where the node fails to launch, power up, or join...
Read more >AWS ParallelCluster Troubleshooting - AWS Documentation
This section covers how you can troubleshoot node initialization issues. This includes issues where the node fails to launch, power up, or join...
Read more >Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start FreeTop Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Top GitHub Comments
We’re going to open some new issue(s) about making our reported versions/release tag names make a bit more sense. It sounds like you managed to get a working setup @jpuerto-psc so I am going to close this out.
@jpuerto-psc The tag has ‘dirty’ in it since you may have had changes in
git
that weren’t committed yet when you created the docker image. You can take a look here to see when it determines to include it.The MOTD uses what
TOIL_APPLIANCE_SELF
is set to when you runmake push_docker
, so it might be showing the wrong tag if you set it after the image was created. This shows whatTOIL_APPLIANCE_SELF
defaults to.Yep, those changes should be merged into master so you could
make push_docker
from that. The 5.5.0 release also just got pushed this past weekend so you can use that image too.