Can we add configable debug settings for delay pod delete when there is a `Error` state of pods ?
See original GitHub issueDescription
Add configable debug settings for delay pod delete when there is a Error
state of pods.
Use case / motivation
In apache/airflow:1.10.10
image.
I’m deploy a airflow in k8s, want to use Kubernetes Executor for task excute. If the pod got Error state, airflow scheduler would delete pod immediately. So we can not see what happend, pod is deleted in some seconds.
When I add time.sleep()
in kubernetes_executor.py:896
, like this:
def _change_state(self, key, state, pod_id, namespace):
if state != State.RUNNING:
if self.kube_config.delete_worker_pods:
for x in range(120):
self.log.info(str(x) + ": sleep 1s for...")
time.sleep(1)
self.kube_scheduler.delete_pod(pod_id, namespace)
self.log.info('Deleted pod: %s in namespace %s', str(key), str(namespace))
try:
self.running.pop(key)
except KeyError:
self.log.debug('Could not find key: %s', str(key))
self.event_buffer[key] = state
When trigger execute manully, I can see pod got Error
state soon.
➜ ~ kubectl get po
NAME READY STATUS RESTARTS AGE
airflow-564c84ff46-tn5mg 2/2 Running 0 67s
examplebashoperatorrunme0-76fd68aa96d64e8c93c7c87904f3312a 0/1 Error 0 24s
Watch pod’s log:
➜ ~ kubectl logs -f examplebashoperatorrunme0-76fd68aa96d64e8c93c7c87904f3312a
Traceback (most recent call last):
File "/home/airflow/.local/bin/airflow", line 23, in <module>
import argcomplete
ModuleNotFoundError: No module named 'argcomplete'
It’s a error in container. It’s easy to debug now.
Issue Analytics
- State:
- Created 3 years ago
- Reactions:6
- Comments:9 (2 by maintainers)
Top Results From Across the Web
How to Debug Kubernetes Pending Pods and Scheduling ...
Learn how to debug Pending pods that fail to get scheduled due to resource constraints, taints, affinity rules, and other reasons.
Read more >Force Delete StatefulSet Pods - Kubernetes
This page shows how to delete Pods which are part of a stateful set, and explains the considerations to keep in mind when...
Read more >Delaying Shutdown to Wait for Pod Deletion Propagation
When a pod is removed from the cluster via the API, all that is happening is that the pod is marked for deletion...
Read more >Troubleshooting 'terminated with exit code 1' error - ContainIQ
Sometimes an “off and on again” approach can prove effective. Delete the pod completely, then add it back into your cluster. This can...
Read more >Kubernetes CrashLoopBackOff: What it is, and how to fix it?
Kubernetes will wait an increasing back-off time between restarts to give you a chance to fix the error. As such, CrashLoopBackOff is not...
Read more >Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start FreeTop Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Top GitHub Comments
AIRFLOW__KUBERNETES__RUN_AS_USER: “50000”
Hi gwind, how did you solve the container error?
ModuleNotFoundError: No module named 'argcomplete'
I have the same issue in pods with the Kubernetes executor and the example DAGsThere is an option to keep / not delete worker pods: AIRFLOW__KUBERNETES__DELETE_WORKER_PODS: “false”