question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

compute instances fail health check in endless loop

See original GitHub issue

Environment: aws-parallelcluster-2.4.1 centos7 sge master: c5.9xlarge compute: c5n.18xlarge

The compute nodes never become live because they continually fail the health check on start-up and are terminated. Here’s the output from /var/log/sqswatcher on the master node:

2019-10-25 02:56:16,247 INFO [sqswatcher:_retrieve_all_sqs_messages] Retrieving messages from SQS queue
2019-10-25 02:56:18,259 INFO [sqswatcher:_retrieve_all_sqs_messages] Retrieved 0 messages from SQS queue
2019-10-25 02:56:48,289 INFO [sqswatcher:_retrieve_all_sqs_messages] Retrieving messages from SQS queue
2019-10-25 02:56:50,324 INFO [sqswatcher:_retrieve_all_sqs_messages] Retrieved 0 messages from SQS queue
2019-10-25 02:57:20,354 INFO [sqswatcher:_retrieve_all_sqs_messages] Retrieving messages from SQS queue
2019-10-25 02:57:22,363 INFO [sqswatcher:_retrieve_all_sqs_messages] Retrieved 0 messages from SQS queue
2019-10-25 02:57:52,393 INFO [sqswatcher:_poll_queue] Refreshing cluster properties
2019-10-25 02:57:52,499 INFO [utils:get_asg_settings] min/desired/max 0/1/6
2019-10-25 02:57:52,564 INFO [sqswatcher:_retrieve_all_sqs_messages] Retrieving messages from SQS queue
2019-10-25 02:57:54,574 INFO [sqswatcher:_retrieve_all_sqs_messages] Retrieved 0 messages from SQS queue
2019-10-25 02:58:24,604 INFO [sqswatcher:_retrieve_all_sqs_messages] Retrieving messages from SQS queue
2019-10-25 02:58:26,613 INFO [sqswatcher:_retrieve_all_sqs_messages] Retrieved 0 messages from SQS queue
2019-10-25 02:58:56,643 INFO [sqswatcher:_retrieve_all_sqs_messages] Retrieving messages from SQS queue
2019-10-25 02:58:58,703 INFO [sqswatcher:_retrieve_all_sqs_messages] Retrieved 1 messages from SQS queue
2019-10-25 02:58:58,739 ERROR [sqswatcher:_process_instance_terminate_event] Instance i-0012f6c570f00bcd9 not found in the database.
2019-10-25 02:58:58,739 WARNING [sqswatcher:_parse_sqs_messages] Discarding message sqs.Message(queue_url='https://queue.amazonaws.com/684353139040/parallelcluster-meredithk-test-efa-nohyper1-intel3-SQS-1Q2QW8X745LEM', receipt_handle='AQEBc+SA48ZuhUmx1xVpZQipj8SM8xXztziZmkdk1lQjLwNB+F2rGHWrbG2ZKDtvMG4VsI1ek2PgC9fcw/aY6+Q/Tt+0jEMzYZhrDtwqycJpKYdFJzWjY5/blVSNbuc1ZQTqi7QhxlKkySEZ/igX4uFTGgVoZxGw6SFrDzq9IWjn7yJ54ZyJN8rPIthi57QmkU5inlSwPV5pcj6oAftOMPzGxcxv56KoMlqmgof6RIIW66esYzm89d4zWewk+iAolrmtkzD4eJoZQS/jbQT0HMRTFMlt5ufT48WEaKt5WUyL86i6UgCwKtuINdyqi3e/CeUEtxdU+n9oPvpn2im8+vth8dzg1JughlcSsJAJCKohHSamTpSaOhd4DWOW9DOnvGpjl/KBBodTAsg/6073UEr2mE2B8Qbjir3Nt7hwVKJD8iED8YVsMp3SdAfdcyg7naR94n1sZdcvi/PTx7/3K3WY7g==')
2019-10-25 02:59:28,779 INFO [sqswatcher:_retrieve_all_sqs_messages] Retrieving messages from SQS queue
2019-10-25 02:59:30,787 INFO [sqswatcher:_retrieve_all_sqs_messages] Retrieved 0 messages from SQS queue
2019-10-25 03:00:00,818 INFO [sqswatcher:_retrieve_all_sqs_messages] Retrieving messages from SQS queue
2019-10-25 03:00:02,827 INFO [sqswatcher:_retrieve_all_sqs_messages] Retrieved 0 messages from SQS queue
2019-10-25 03:00:32,857 INFO [sqswatcher:_retrieve_all_sqs_messages] Retrieving messages from SQS queue
2019-10-25 03:00:34,865 INFO [sqswatcher:_retrieve_all_sqs_messages] Retrieved 0 messages from SQS queue
2019-10-25 03:01:04,895 INFO [sqswatcher:_poll_queue] Refreshing cluster properties
2019-10-25 03:01:04,988 INFO [utils:get_asg_settings] min/desired/max 0/1/6
2019-10-25 03:01:05,057 INFO [sqswatcher:_retrieve_all_sqs_messages] Retrieving messages from SQS queue
2019-10-25 03:01:07,065 INFO [sqswatcher:_retrieve_all_sqs_messages] Retrieved 0 messages from SQS queue
2019-10-25 03:01:37,095 INFO [sqswatcher:_retrieve_all_sqs_messages] Retrieving messages from SQS queue
2019-10-25 03:01:39,183 INFO [sqswatcher:_retrieve_all_sqs_messages] Retrieved 1 messages from SQS queue
2019-10-25 03:01:39,218 ERROR [sqswatcher:_process_instance_terminate_event] Instance i-07a689ee5610c8c26 not found in the database.
2019-10-25 03:01:39,218 WARNING [sqswatcher:_parse_sqs_messages] Discarding message sqs.Message(queue_url='https://queue.amazonaws.com/684353139040/parallelcluster-meredithk-test-efa-nohyper1-intel3-SQS-1Q2QW8X745LEM', receipt_handle='AQEByn1myDFouo6xqD1TvU+X7fZdqAVZhrqalZ47BunjAEM4egT7VKB1bGmRvyhzge+1CdZTJUyL6iFp5e/2HeAqCmeObPsaxQkFIUr7IdoplLIEqhuufuCdo/k2Z2BwJlW5naxLgrkHdPXqXl0t/xx06fN3lEnsiC3e1mSwxRyPqk1vxtFGInr8zMLk4Y8FSok91AYXmfQ+sBwcL4xASfBoz9AU9tqqhQA2KzHZltOA891GAi/HIp+lAvYvqWqiG9g03m7iAMzNEtq4beBeqhb4jkTAi8MuziLh/7ggezcLH2H6C9W4En/pEKK98zPqQwKdBHrP4anvelEBas9AvsoBKkTotnbH1bdIrljIe9sJmZLXZTGeWh/26b1AgITyJ5W5anZtPoh4t/t2L+Q5P9yH4Y3n2pLtQxwXVzjrCdJ0txy8KcjglI3vSmsznkg3iVqV7N+dXY5RhXhrA+k+csEWkA==')

Here is the config file:

[aws]
aws_access_key_id = ###
aws_secret_access_key = ###
#aws_region_name = us-east-1

[cluster default]
key_name = fire
master_instance_type = c5.9xlarge
compute_instance_type = c5n.18xlarge
base_os = centos7
#cluster_type = spot
spot_price = 5
initial_queue_size = 0
maintain_initial_size = true
max_queue_size = 6
vpc_settings = poc_vpn
tags = {"user" : "meredithk"}
fsx_settings = custom_fs
ebs_settings = shared
fs_settings = customefs
placement_group = DYNAMIC
enable_efa = compute
# centos7
#post_install = s3://postinstallfmg/parallelcluster-postinstall-centos7-v1.sh
# alinux
#post_install = s3://postinstallfmg/parallelcluster-postinstall-v1.sh
extra_json = { "cfncluster" : { "cfn_scheduler_slots" : "cores" } }
master_root_volume_size = 50

[vpc poc_vpn]
vpc_id = vpc-b53f65d1
use_public_ips = false
# useast1a
#master_subnet_id = subnet-a6a304fe
# useast1b
master_subnet_id = subnet-0f2f6a50145d03c60
# additional_sg necessary for efs mounting
additional_sg = sg-0b39ee73

[global]
sanity_check = true
update_check = true
cluster_template = default

[aliases]
ssh = ssh {CFN_USER}@{MASTER_IP} {ARGS}

[fsx custom_fs]
shared_dir = /fsx
storage_capacity = 3600
imported_file_chunk_size = 1024
import_path = s3://fmglobal-virtual-fire-scenarios

[ebs shared]
shared_dir = /shared
volume_size = 2000

[efs customefs]
shared_dir = /efs
efs_fs_id = fs-4b70dd00

I’ve attached a screenshot of the autoscaling group from the AWS console

Screen Shot 2019-10-24 at 11 04 24 PM

Issue Analytics

  • State:closed
  • Created 4 years ago
  • Comments:27 (13 by maintainers)

github_iconTop GitHub Comments

1reaction
demartinofracommented, Dec 13, 2019

VPC with multiple CIDR blocks is now supported as part of v2.5.1: https://github.com/aws/aws-parallelcluster/releases/tag/v2.5.1

1reaction
karlvirgilcommented, Dec 5, 2019

This worked! Just fyi, on centos7 the restart command is sudo systemctl restart nfs

Could we keep this issue open so that when the bug is officially fixed I get notified?

Read more comments on GitHub >

github_iconTop Results From Across the Web

Troubleshoot Amazon EC2 Auto Scaling: Health checks
Solution 1: If a health check fails because a user manually stopped, rebooted, or terminated the instance, this is due to how Amazon...
Read more >
LoadBalancer health check fails but instance is not terminating
Hello, I have a load balancer which as you know keeps the health check for the web app/website. I have deployed nothing in...
Read more >
Instance Warmup vs CoolDown Period - A Cloud Guru
The first doubt is: Health Check Grace Period is the same of Instance Warmup. The explanation of these two concepts is the same...
Read more >
docker-compose healthcheck does not work in a way it is ...
The two examples are based on the condition form of depends_on which is no longer supported in compose version 3.
Read more >
Troubleshooting Geo - GitLab Docs
We perform the following health checks on each secondary site to help ... To find more details about failed items, check the gitlab-rails/geo.log...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found