Waiting for k8s nodes to reach count
See original GitHub issueWe have a big and busy EKS cluster with nodes joining and leaving many times in a day (spot instances failing or being replaced). We try to update each ASG separately with ASG_NAMES setting. The problem is, the eks-rolling-update always checks the whole cluster for node count and it many times fails as node count is not matched with expected.
It should only monitor the selected ASG(s) for expected instance count.
2021-02-10 16:26:57,425 INFO Current k8s node count is 94
2021-02-10 16:26:57,426 INFO Current k8s node count is 94
2021-02-10 16:26:57,426 INFO Waiting for k8s nodes to reach count 92...
2021-02-10 16:27:18,198 INFO Getting k8s nodes...
2021-02-10 16:27:19,341 INFO Current k8s node count is 94
2021-02-10 16:27:19,342 INFO Current k8s node count is 94
2021-02-10 16:27:19,342 INFO Waiting for k8s nodes to reach count 92...
2021-02-10 16:27:40,119 INFO Getting k8s nodes...
2021-02-10 16:27:41,470 INFO Current k8s node count is 94
2021-02-10 16:27:41,471 INFO Current k8s node count is 94
2021-02-10 16:27:41,471 INFO Waiting for k8s nodes to reach count 92...
...
2021-02-10 16:28:01,472 INFO Validation failed for cluster *****. Didn't reach expected node count 92.
2021-02-10 16:28:01,472 INFO Exiting since ASG healthcheck failed after 2 attempts
2021-02-10 16:28:01,472 ERROR ASG healthcheck failed
2021-02-10 16:28:01,472 ERROR *** Rolling update of ASG has failed. Exiting ***
2021-02-10 16:28:01,472 ERROR AWS Auto Scaling Group processes will need resuming manually
Issue Analytics
- State:
- Created 3 years ago
- Reactions:4
- Comments:7 (2 by maintainers)
Top Results From Across the Web
How to Debug Kubernetes Pending Pods and Scheduling ...
Learn how to debug Pending pods that fail to get scheduled due to resource constraints, taints, affinity rules, and other reasons.
Read more >Nodes - Kubernetes
In most cases, the node controller limits the eviction rate to --node-eviction-rate (default 0.1) per second, meaning it won't evict pods from ...
Read more >Why do Kubernetes pod stay in pending state? - Stackify
Your pod remaining in 'waiting' status means it has been scheduled in the worker's node. Yet, the pod can't run on said machine....
Read more >kernel:unregister_netdevice: waiting for eth0 to become free ...
This problem occurs when scaling down pods in kubernetes. A reboot of the node is required to rectify. This has been seen after...
Read more >Automatic Remediation of Kubernetes Nodes
unregister_netdevice: waiting for lo to become free. Usage count = 1. The issue is further observed with the number of network interfaces on ......
Read more >Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start FreeTop Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Top GitHub Comments
By default CLUSTER_HEALTH_RETRY=1 so it will fail soon (you can open README and see it). you need to increase this value by export CLUSTER_HEALTH_RETRY=10. That means it will try to check 10 times and it have enough time to verify cluster health check instead 1 by default.
It helps, but it doesn’t necessarily solve the issue, as sometimes some other ASG in cluster is resized in the meantime and it will never reach the wanted value.
I’ve modified the code to it simply gives up after some tries and does the changes anyway. It fits our case and the whole process is monitored by a human, anyway.