Kubernetes Executor does not delete pods stuck at creating because of volume mount errors
See original GitHub issueApache Airflow version: 2.0.1 Git Version:.release:2.0.1+beb8af5ac6c438c29e2c186145115fb1334a3735
Kubernetes version (if you are using kubernetes) (use kubectl version
): 1.17.17-gke.2800
Environment:
- Cloud provider or hardware configuration: GKE (Kubernetes)
- OS (e.g. from /etc/os-release): Debian GNU/Linux 10 (buster) - docker image python:3.8-slim-buster
- Kernel (e.g.
uname -a
): Linux bd0d5605654a 4.15.0-140-generic #144-Ubuntu SMP Fri Mar 19 14:12:35 UTC 2021 x86_64 GNU/Linux - Install tools: pip
- Others: Python 3.8, Kubernetes Executor, Docker
What happened: Pod template contained non existing volume which caused pod to be impossible to run. The volume existed before but was deleted. Task in Airflow was also stuck at “queued”. Even after clearing task these pods stayed stuck in container creating and it seems that they need to be manually deleted.
Pods are stuck with
Unable to attach or mount volumes: unmounted volumes=[secret-volume], unattached volumes=[google-key airflow-logs secret-volume]: timed out waiting for the condition
MountVolume.SetUp failed for volume "secret-volume" : secret "airflow-secret-14610" not found
Configuration:
AIRFLOW__KUBERNETES__DELETE_WORKER_PODS=True
AIRFLOW__KUBERNETES__DELETE_WORKER_PODS_ON_FAILURE=True
What you expected to happen: I would expect Airflow to delete pods that are not possible to be created, at least after clearing the task.
How to reproduce it: Create a pod template with a volume and later delete that volume without pausing DAGs
Anything else we need to know: It happens all the time and pods are not being deleted.
Issue Analytics
- State:
- Created 2 years ago
- Comments:5 (4 by maintainers)
Top GitHub Comments
This is related to, but not a duplicate of that other one. This issue identifies that the poison pill (e.g. “Mark failed”) doesn’t clean up the pending pod.
Basically, the root problem is that once the scheduler creates the worker pod and sticks the TI in queued, it only listens to k8s events. If the pod will be ‘forever pending’ due to missing volume, well, it gets stuck forever. We probably want some timeout to handle these. I’ve opened #15218 to address this.
Oh yea looks like it, https://github.com/apache/airflow/pull/14810 should fix it, which will be in 2.0.2.