Feature request - cordon one node at a time instead of all nodes
See original GitHub issuewith RUN_MODE=1
all old nodes are cordoned at a same time, which makes AWS ELB to mark old nodes out of service, if new nodes sometimes take time to be in service then no healthy instances are left for sometime which causes outage
we tried cordnoning 1 node at a time and didnt see this issue, downside of this is that a pod may bounce multiple times because it may land up on old node because not all old nodes are cordoned, some people will be fine with bouncing one of a pod multiple times among multiple replicas.
can we have RUN_MODE
5 which is same as RUN_MODE
1 except it “cordon 1 node --> drain 1 node --> delete 1 node” at a time instead of “cordon all nodes --> drain 1 node --> delete 1 node”
Issue Analytics
- State:
- Created 3 years ago
- Reactions:1
- Comments:6 (3 by maintainers)
Top Results From Across the Web
Nodes - Kubernetes
Kubernetes runs your workload by placing containers into Pods to run on Nodes. A node may be a virtual or physical machine, depending...
Read more >Draining or Cordoning a Node - Tencent Cloud
You can cordon a node with one of the following two methods: ... After the node is drained, all Pods (excluding those managed...
Read more >Managed node groups - Amazon EKS - AWS Documentation
Amazon EKS managed node groups automate the provisioning and lifecycle management of nodes (Amazon EC2 instances) for Amazon EKS Kubernetes clusters.
Read more >Drain a node on the swarm - Docker Documentation
The swarm manager can assign tasks to any ACTIVE node, so up to now all nodes have been available to receive tasks. Sometimes,...
Read more >Nodes and Node Pools | Rancher Manager
Rather than using the Rancher UI to make edits such as scaling the number of nodes up or down, edit the cluster directly....
Read more >Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start FreeTop Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Top GitHub Comments
thanks for the explanation, it does what we expect exactly 😃 I am closing this request as
TAINT_NODES=true
option does exactly what we wantIt doesn’t do any cordoning - it’s an alternative strategy.
Interacting with LBs isn’t the purpose of cordoning to my knowledge - cordoning is about preventing scheduling of new workloads - the effect on the service-managed LBs is an unintended side effect which I believe is why they have removed it in Kubernetes 1.19.
The tool uses
terminate_instance_in_auto_scaling_group
to orchestrate termination of instances in an ASG-aware fashion, and thus ensure your target group deregistration delay is respected; allowing any remaining traffic to drain off the instance before it is actually terminated.Perhaps you can try it out - I think you will find it does what you expect 😃