[aws-eks] Stacks with EKS clusters fail tear down due to dangling ELB/ALB holding networking resources
See original GitHub issue❓ General Issue
The tear down process of a stack runs into race conditions when Kubernetes Operators and Controllers are involved that manages external deployment of resources. As an example, say we have an ALB Ingress Controller deployed through a Helm chart. Following that, we deploy a couple Ingress resources which the ALB Ingress Controller will create ALBs for. When the stack is removed, the removal of k8s resources often orphans that cloud resource equivalents, which causes the stack to fail cleanup because these orphaned resources leave bread crumbs that blocks things like SG/VPC/ENI removals. Is there a way to clean up resources properly when using CDK with operator and controllers? I’ve tried separating K8s resources (helmchart/manifest) into a separate stack so I could manually invoke a sleep
between the cdk destroy
command. But ran into trouble even separating the stacks out due to the circular dependencies between the cluster and these K8s overlay resources.
A side note. I’m aware that ALB Ingress Controller introduced finalizers into the Ingress resource it manages. This means the resource doesn’t delete from the K8s control plane until the Ingress Controller has removed all AWS resources which is good. Maybe the aws-eks
resource delete mechanism doesn’t wait?
These are some scenarios that I’ve found which causes race conditions during Stack teardown.
- ALB Ingress Controller
- Removal of Ingress object causes race condition between VPC deletion and ALB where ALB is not fully removed by the time VPC is being removed as well. Often I see ENI breadcrumbs that fails the VPC removal.
- Removal of ALB Ingress Controller before the Ingress object removal have been fulfilled in processing. This shows itself with a fully intact ALB which causes VPC deletion failures.
- External DNS Controller
- The External DNS Controller might be deleted before it has had time to process fulfillment for proxy resources that were decorated with dns entries. I often see Route53 zone with left over CNAME/A/TXT records owned by External DNS Controller.
- Any Operator/Deployment
- When resources are removed all at once, the kubelets have not had enough time to delete the supporting pods for the resource sets. But because acknowledgement from the K8s control plane has been sent that deletion of those resources is recorded, Cloudformation acknowledges as a fulfilled state. This often lets Cloudformation follow up with deleting of the managed worker pools. When this happens, the pods that weren’t deleted leaves orphaned ENI since the EC2 supporting them no longer exist. This is possible when using the CNI that supports AWS VPC native networking.
Issue Analytics
- State:
- Created 3 years ago
- Comments:13 (9 by maintainers)
Top GitHub Comments
Hi @ten-lac - Thanks for reporting this! We have seen this happen a few times now and are considering how to address this.
As far as this workaround, we know that circular dependencies can happen quite often when working with multiple stacks. In fact, there is PR that addresses some of these scenarios. I’m interested specifically in the cases you encounter these circular dependencies, could you share a few snippets that you’ve tried that suffer from this?
Regardless, having to split out to a different stack and manually orchestrate the destruction with
sleep
is definitely not a solution we want to land on.Indeed, when a resource is deleted, we call
kubectl delete
, which is an async operation. The solution is probably going to be usingkubectl wait
.Stay tuned 😃
Hi there,
We have an issue on our side who seems related. On our side,
cdk destroy
sometimes fail to remove our CloudFormation stacks because it fail to remove dependent ENI and the ENI fail to be removed because it fail to remove some SecurityGroup related to it. Sometimes, the same error appear on S3Bucket removal.The funny thing is that the error randomly appear without changes into the cdk code. Look like some kind of race condition.