Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Waiting for k8s nodes to reach count

See original GitHub issue

We have a big and busy EKS cluster with nodes joining and leaving many times in a day (spot instances failing or being replaced). We try to update each ASG separately with ASG_NAMES setting. The problem is, the eks-rolling-update always checks the whole cluster for node count and it many times fails as node count is not matched with expected.

It should only monitor the selected ASG(s) for expected instance count.

2021-02-10 16:26:57,425 INFO     Current k8s node count is 94
2021-02-10 16:26:57,426 INFO     Current k8s node count is 94
2021-02-10 16:26:57,426 INFO     Waiting for k8s nodes to reach count 92...
2021-02-10 16:27:18,198 INFO     Getting k8s nodes...
2021-02-10 16:27:19,341 INFO     Current k8s node count is 94
2021-02-10 16:27:19,342 INFO     Current k8s node count is 94
2021-02-10 16:27:19,342 INFO     Waiting for k8s nodes to reach count 92...
2021-02-10 16:27:40,119 INFO     Getting k8s nodes...
2021-02-10 16:27:41,470 INFO     Current k8s node count is 94
2021-02-10 16:27:41,471 INFO     Current k8s node count is 94
2021-02-10 16:27:41,471 INFO     Waiting for k8s nodes to reach count 92...
...
2021-02-10 16:28:01,472 INFO     Validation failed for cluster *****. Didn't reach expected node count 92.
2021-02-10 16:28:01,472 INFO     Exiting since ASG healthcheck failed after 2 attempts
2021-02-10 16:28:01,472 ERROR    ASG healthcheck failed
2021-02-10 16:28:01,472 ERROR    *** Rolling update of ASG has failed. Exiting ***
2021-02-10 16:28:01,472 ERROR    AWS Auto Scaling Group processes will need resuming manually

Issue Analytics

State:
Created 3 years ago
Reactions:4
Comments:7 (2 by maintainers)

Top GitHub Comments

2reactions

dat-cao-tien-moxcommented, Apr 29, 2021

By default CLUSTER_HEALTH_RETRY=1 so it will fail soon (you can open README and see it). you need to increase this value by export CLUSTER_HEALTH_RETRY=10. That means it will try to check 10 times and it have enough time to verify cluster health check instead 1 by default.

1reaction

thorrocommented, Apr 29, 2021

It helps, but it doesn’t necessarily solve the issue, as sometimes some other ASG in cluster is resized in the meantime and it will never reach the wanted value.

I’ve modified the code to it simply gives up after some tries and does the changes anyway. It fits our case and the whole process is monitored by a human, anyway.