compute instances fail health check in endless loop
See original GitHub issueEnvironment: aws-parallelcluster-2.4.1 centos7 sge master: c5.9xlarge compute: c5n.18xlarge
The compute nodes never become live because they continually fail the health check on start-up and are terminated. Here’s the output from /var/log/sqswatcher on the master node:
2019-10-25 02:56:16,247 INFO [sqswatcher:_retrieve_all_sqs_messages] Retrieving messages from SQS queue
2019-10-25 02:56:18,259 INFO [sqswatcher:_retrieve_all_sqs_messages] Retrieved 0 messages from SQS queue
2019-10-25 02:56:48,289 INFO [sqswatcher:_retrieve_all_sqs_messages] Retrieving messages from SQS queue
2019-10-25 02:56:50,324 INFO [sqswatcher:_retrieve_all_sqs_messages] Retrieved 0 messages from SQS queue
2019-10-25 02:57:20,354 INFO [sqswatcher:_retrieve_all_sqs_messages] Retrieving messages from SQS queue
2019-10-25 02:57:22,363 INFO [sqswatcher:_retrieve_all_sqs_messages] Retrieved 0 messages from SQS queue
2019-10-25 02:57:52,393 INFO [sqswatcher:_poll_queue] Refreshing cluster properties
2019-10-25 02:57:52,499 INFO [utils:get_asg_settings] min/desired/max 0/1/6
2019-10-25 02:57:52,564 INFO [sqswatcher:_retrieve_all_sqs_messages] Retrieving messages from SQS queue
2019-10-25 02:57:54,574 INFO [sqswatcher:_retrieve_all_sqs_messages] Retrieved 0 messages from SQS queue
2019-10-25 02:58:24,604 INFO [sqswatcher:_retrieve_all_sqs_messages] Retrieving messages from SQS queue
2019-10-25 02:58:26,613 INFO [sqswatcher:_retrieve_all_sqs_messages] Retrieved 0 messages from SQS queue
2019-10-25 02:58:56,643 INFO [sqswatcher:_retrieve_all_sqs_messages] Retrieving messages from SQS queue
2019-10-25 02:58:58,703 INFO [sqswatcher:_retrieve_all_sqs_messages] Retrieved 1 messages from SQS queue
2019-10-25 02:58:58,739 ERROR [sqswatcher:_process_instance_terminate_event] Instance i-0012f6c570f00bcd9 not found in the database.
2019-10-25 02:58:58,739 WARNING [sqswatcher:_parse_sqs_messages] Discarding message sqs.Message(queue_url='https://queue.amazonaws.com/684353139040/parallelcluster-meredithk-test-efa-nohyper1-intel3-SQS-1Q2QW8X745LEM', receipt_handle='AQEBc+SA48ZuhUmx1xVpZQipj8SM8xXztziZmkdk1lQjLwNB+F2rGHWrbG2ZKDtvMG4VsI1ek2PgC9fcw/aY6+Q/Tt+0jEMzYZhrDtwqycJpKYdFJzWjY5/blVSNbuc1ZQTqi7QhxlKkySEZ/igX4uFTGgVoZxGw6SFrDzq9IWjn7yJ54ZyJN8rPIthi57QmkU5inlSwPV5pcj6oAftOMPzGxcxv56KoMlqmgof6RIIW66esYzm89d4zWewk+iAolrmtkzD4eJoZQS/jbQT0HMRTFMlt5ufT48WEaKt5WUyL86i6UgCwKtuINdyqi3e/CeUEtxdU+n9oPvpn2im8+vth8dzg1JughlcSsJAJCKohHSamTpSaOhd4DWOW9DOnvGpjl/KBBodTAsg/6073UEr2mE2B8Qbjir3Nt7hwVKJD8iED8YVsMp3SdAfdcyg7naR94n1sZdcvi/PTx7/3K3WY7g==')
2019-10-25 02:59:28,779 INFO [sqswatcher:_retrieve_all_sqs_messages] Retrieving messages from SQS queue
2019-10-25 02:59:30,787 INFO [sqswatcher:_retrieve_all_sqs_messages] Retrieved 0 messages from SQS queue
2019-10-25 03:00:00,818 INFO [sqswatcher:_retrieve_all_sqs_messages] Retrieving messages from SQS queue
2019-10-25 03:00:02,827 INFO [sqswatcher:_retrieve_all_sqs_messages] Retrieved 0 messages from SQS queue
2019-10-25 03:00:32,857 INFO [sqswatcher:_retrieve_all_sqs_messages] Retrieving messages from SQS queue
2019-10-25 03:00:34,865 INFO [sqswatcher:_retrieve_all_sqs_messages] Retrieved 0 messages from SQS queue
2019-10-25 03:01:04,895 INFO [sqswatcher:_poll_queue] Refreshing cluster properties
2019-10-25 03:01:04,988 INFO [utils:get_asg_settings] min/desired/max 0/1/6
2019-10-25 03:01:05,057 INFO [sqswatcher:_retrieve_all_sqs_messages] Retrieving messages from SQS queue
2019-10-25 03:01:07,065 INFO [sqswatcher:_retrieve_all_sqs_messages] Retrieved 0 messages from SQS queue
2019-10-25 03:01:37,095 INFO [sqswatcher:_retrieve_all_sqs_messages] Retrieving messages from SQS queue
2019-10-25 03:01:39,183 INFO [sqswatcher:_retrieve_all_sqs_messages] Retrieved 1 messages from SQS queue
2019-10-25 03:01:39,218 ERROR [sqswatcher:_process_instance_terminate_event] Instance i-07a689ee5610c8c26 not found in the database.
2019-10-25 03:01:39,218 WARNING [sqswatcher:_parse_sqs_messages] Discarding message sqs.Message(queue_url='https://queue.amazonaws.com/684353139040/parallelcluster-meredithk-test-efa-nohyper1-intel3-SQS-1Q2QW8X745LEM', receipt_handle='AQEByn1myDFouo6xqD1TvU+X7fZdqAVZhrqalZ47BunjAEM4egT7VKB1bGmRvyhzge+1CdZTJUyL6iFp5e/2HeAqCmeObPsaxQkFIUr7IdoplLIEqhuufuCdo/k2Z2BwJlW5naxLgrkHdPXqXl0t/xx06fN3lEnsiC3e1mSwxRyPqk1vxtFGInr8zMLk4Y8FSok91AYXmfQ+sBwcL4xASfBoz9AU9tqqhQA2KzHZltOA891GAi/HIp+lAvYvqWqiG9g03m7iAMzNEtq4beBeqhb4jkTAi8MuziLh/7ggezcLH2H6C9W4En/pEKK98zPqQwKdBHrP4anvelEBas9AvsoBKkTotnbH1bdIrljIe9sJmZLXZTGeWh/26b1AgITyJ5W5anZtPoh4t/t2L+Q5P9yH4Y3n2pLtQxwXVzjrCdJ0txy8KcjglI3vSmsznkg3iVqV7N+dXY5RhXhrA+k+csEWkA==')
Here is the config file:
[aws]
aws_access_key_id = ###
aws_secret_access_key = ###
#aws_region_name = us-east-1
[cluster default]
key_name = fire
master_instance_type = c5.9xlarge
compute_instance_type = c5n.18xlarge
base_os = centos7
#cluster_type = spot
spot_price = 5
initial_queue_size = 0
maintain_initial_size = true
max_queue_size = 6
vpc_settings = poc_vpn
tags = {"user" : "meredithk"}
fsx_settings = custom_fs
ebs_settings = shared
fs_settings = customefs
placement_group = DYNAMIC
enable_efa = compute
# centos7
#post_install = s3://postinstallfmg/parallelcluster-postinstall-centos7-v1.sh
# alinux
#post_install = s3://postinstallfmg/parallelcluster-postinstall-v1.sh
extra_json = { "cfncluster" : { "cfn_scheduler_slots" : "cores" } }
master_root_volume_size = 50
[vpc poc_vpn]
vpc_id = vpc-b53f65d1
use_public_ips = false
# useast1a
#master_subnet_id = subnet-a6a304fe
# useast1b
master_subnet_id = subnet-0f2f6a50145d03c60
# additional_sg necessary for efs mounting
additional_sg = sg-0b39ee73
[global]
sanity_check = true
update_check = true
cluster_template = default
[aliases]
ssh = ssh {CFN_USER}@{MASTER_IP} {ARGS}
[fsx custom_fs]
shared_dir = /fsx
storage_capacity = 3600
imported_file_chunk_size = 1024
import_path = s3://fmglobal-virtual-fire-scenarios
[ebs shared]
shared_dir = /shared
volume_size = 2000
[efs customefs]
shared_dir = /efs
efs_fs_id = fs-4b70dd00
I’ve attached a screenshot of the autoscaling group from the AWS console

Issue Analytics
- State:
- Created 4 years ago
- Comments:27 (13 by maintainers)
Top Results From Across the Web
Troubleshoot Amazon EC2 Auto Scaling: Health checks
Solution 1: If a health check fails because a user manually stopped, rebooted, or terminated the instance, this is due to how Amazon...
Read more >LoadBalancer health check fails but instance is not terminating
Hello, I have a load balancer which as you know keeps the health check for the web app/website. I have deployed nothing in...
Read more >Instance Warmup vs CoolDown Period - A Cloud Guru
The first doubt is: Health Check Grace Period is the same of Instance Warmup. The explanation of these two concepts is the same...
Read more >docker-compose healthcheck does not work in a way it is ...
The two examples are based on the condition form of depends_on which is no longer supported in compose version 3.
Read more >Troubleshooting Geo - GitLab Docs
We perform the following health checks on each secondary site to help ... To find more details about failed items, check the gitlab-rails/geo.log...
Read more >
Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free
Top Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
VPC with multiple CIDR blocks is now supported as part of v2.5.1: https://github.com/aws/aws-parallelcluster/releases/tag/v2.5.1
This worked! Just fyi, on centos7 the restart command is
sudo systemctl restart nfs
Could we keep this issue open so that when the bug is officially fixed I get notified?