question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Kubernetes Executor does not delete pods stuck at creating because of volume mount errors

See original GitHub issue

Apache Airflow version: 2.0.1 Git Version:.release:2.0.1+beb8af5ac6c438c29e2c186145115fb1334a3735

Kubernetes version (if you are using kubernetes) (use kubectl version): 1.17.17-gke.2800

Environment:

  • Cloud provider or hardware configuration: GKE (Kubernetes)
  • OS (e.g. from /etc/os-release): Debian GNU/Linux 10 (buster) - docker image python:3.8-slim-buster
  • Kernel (e.g. uname -a): Linux bd0d5605654a 4.15.0-140-generic #144-Ubuntu SMP Fri Mar 19 14:12:35 UTC 2021 x86_64 GNU/Linux
  • Install tools: pip
  • Others: Python 3.8, Kubernetes Executor, Docker

What happened: Pod template contained non existing volume which caused pod to be impossible to run. The volume existed before but was deleted. Task in Airflow was also stuck at “queued”. Even after clearing task these pods stayed stuck in container creating and it seems that they need to be manually deleted.

Pods are stuck with

Unable to attach or mount volumes: unmounted volumes=[secret-volume], unattached volumes=[google-key airflow-logs secret-volume]: timed out waiting for the condition
MountVolume.SetUp failed for volume "secret-volume" : secret "airflow-secret-14610" not found

Configuration:

AIRFLOW__KUBERNETES__DELETE_WORKER_PODS=True
AIRFLOW__KUBERNETES__DELETE_WORKER_PODS_ON_FAILURE=True

What you expected to happen: I would expect Airflow to delete pods that are not possible to be created, at least after clearing the task.

How to reproduce it: Create a pod template with a volume and later delete that volume without pausing DAGs

Anything else we need to know: It happens all the time and pods are not being deleted.

Issue Analytics

  • State:closed
  • Created 2 years ago
  • Comments:5 (4 by maintainers)

github_iconTop GitHub Comments

1reaction
jedcunninghamcommented, Apr 5, 2021

This is related to, but not a duplicate of that other one. This issue identifies that the poison pill (e.g. “Mark failed”) doesn’t clean up the pending pod.

Basically, the root problem is that once the scheduler creates the worker pod and sticks the TI in queued, it only listens to k8s events. If the pod will be ‘forever pending’ due to missing volume, well, it gets stuck forever. We probably want some timeout to handle these. I’ve opened #15218 to address this.

1reaction
kaxilcommented, Apr 2, 2021

Oh yea looks like it, https://github.com/apache/airflow/pull/14810 should fix it, which will be in 2.0.2.

is this the same issue as #14556 ?

Read more comments on GitHub >

github_iconTop Results From Across the Web

Pods stuck in Terminating status - kubernetes - Stack Overflow
Its caused by docker mount leaking into some other namespace. You can logon to pod ... It will delete all pods in Terminating...
Read more >
Pods stuck in Pending or ContainerCreating due to "Failed ...
We're experiencing intermittent issues with the gitlab-runner using the Kubernetes executor (deployed using the first-party Helm charts).
Read more >
1823374 – RHEL 7.8: cleaning up mounts of Kubernetes ...
Create a cluster with RHEL 7.8 workers 2. Try and delete the `tuned daemonset` that runs ``` kubectl delete pods -n openshift-cluster-node-tuning-operator ...
Read more >
Pod Stuck in Terminating State Due to Inability to Clean ...
Error : "error cleaning subPath mounts for volume \"config\" (UniqueName: ... kubectl delete pod --grace-period=0 --force [pod_name].
Read more >
Use the KubernetesPodOperator | Cloud Composer
You can also create and delete clusters using Google Kubernetes Engine operators. ... Launching Kubernetes pods into the environment cluster can cause ......
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found