question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

KubernetesJobWatcher does not delete worker pods

See original GitHub issue

Apache Airflow version: 2.0.0 and 2.0.1

Kubernetes version (if you are using kubernetes) (use kubectl version): 1.18.4 (AKS)

Environment:

  • Cloud provider or hardware configuration: Azure Cloud
  • OS (e.g. from /etc/os-release): Debian GNU/Linux 10 (Buster)
  • Kernel (e.g. uname -a): Linux airflow-scheduler-5cf464667c-7zd6j 5.4.0-1040-azure #42~18.04.1-Ubuntu SMP Mon Feb 8 19:05:32 UTC 2021 x86_64 GNU/Linux
  • Others: Image apache/airflow:2.0.1-python3.8

What happened:

KubernetesJobWatcher does not delete Worker Pods after they are assigned the ‘status.phase=Succeeded’. But this only happens after 30-ish minutes of complete inactivity of the Kubernetes Cluster.

What you expected to happen:

The KubernetesJobWatcher should delete Worker Pods after they have been successful at any time. As my config states (I verfied this with airflow config:

    [kubernetes]
    pod_template_file = /opt/airflow/pod_template_file.yaml
    worker_container_repository = apache/airflow
    worker_container_tag = 2.0.1-python3.8
    delete_worker_pods = True
    delete_worker_pods_on_failure = False

The Executor tries over-and-over again to adopt completed pods.

This is successful. However, the Pods are not cleaned by the KubernetesJobWatcher as no logging of the watcher appears. (I would expect logging from this line)

After some digging, I think the watch.stream() from from kubernetes import client, watch which is called in https://github.com/apache/airflow/blob/v2-0-stable/airflow/executors/kubernetes_executor.py#L143 expires after a long time of complete inactivity. This is also explicitly mentioned in the docstring of the kubernetes.watch.Stream, which was added in this commit after version 11.0.0.

However, my Airflow is using the constraints file which uses the previous version of the Kubernetes client (version 11.0.0) which contains the following watcher.stream.

It seems that Airflow can recover itself by resetting the resource-version. But this does not seem to work for some reason. (I’m currently investigating why)

I think Airflow should be able to recover from this issue automatically. Otherwise I should run a dummy task each 30-ish minutes or so, just to keep the kubernetes.watch.stream() alive.

How to reproduce it: Run Airflow 2+ in a Kubernetes cluster which has no activity at all for 30-ish minutes. Then start an operator. The Kubernetes Worker will not be deleted.

Issue Analytics

  • State:closed
  • Created 2 years ago
  • Comments:10 (3 by maintainers)

github_iconTop GitHub Comments

2reactions
mrpoweruscommented, Mar 30, 2021

After debugging the TCP/IP connections, I found that the connection to the KubeAPI was reset after some minutes of complete inactivity for the kubernetes.Watcher.stream(). However, the watcher seems to think the connection is still fine and continues listening for some (unknown) reason and no error appears.

This would also explain the fact why no logging of the type of Event: ...... was showing up at some point.

The fix seems to be to reset the watcher.stream, by adding the timeout_seconds argument. This ensures that the connection is restarted after some time, which keeps the connection alive.

My previous comment about the ProtocolError is not correct, as the KubernetesWatcher Procees did not raise an Exception. (I only assumed so as it appeared when I was testing my code locally).

This patch seems to solve the problem:

--- kubernetes_executor.py	2021-03-30 13:40:10.957157100 +0200
+++ kubernetes_executor.py	2021-03-30 13:45:13.836000000 +0200
@@ -142,7 +142,7 @@
             list_worker_pods = functools.partial(
                 watcher.stream, kube_client.list_namespaced_pod, self.namespace, **kwargs
             )
-        for event in list_worker_pods():
+        for event in list_worker_pods(timeout_seconds=60):
             task = event['object']
             self.log.info('Event: %s had an event of type %s', task.metadata.name, event['type'])
             if event['type'] == 'ERROR':

0reactions
fashtop3commented, Aug 14, 2022

I hade similar problem when we upgraded to version 2.x Pods get restarted even after the Dags ran successfully.

I later resolved it after a long time of debugging by overriding the pod template and specifying it in the airflow.cfg file.

```

[kubernetes] … pod_template_file = {{ .Values.airflow.home }}/pod_template.yaml …


# pod_template.yaml
apiVersion: v1
kind: Pod
metadata:
  name: dummy-name
spec:
  serviceAccountName: default
  restartPolicy: Never
  containers:
    - name: base
      image: dummy_image
      imagePullPolicy: IfNotPresent
      ports: []
      command: []
Read more comments on GitHub >

github_iconTop Results From Across the Web

[GitHub] [airflow] mrpowerus edited a comment on ... - The Mail Archive
[GitHub] [airflow] mrpowerus edited a comment on issue #14974: KubernetesJobWatcher does not delete worker pods · 2021-04-21 Thread GitBox.
Read more >
Source code for airflow.executors.kubernetes_executor
CoreV1Api = get_kube_client() if not self.scheduler_job_id: raise ... None: """ Patch completed pod so that the KubernetesJobWatcher can delete it.
Read more >
How to Delete Pods from a Kubernetes Node with Examples
To do this, you can use the kubectl drain command to gracefully bring pods up on another node before they are deleted.
Read more >
k8s cluster hangs when running 22 airflow worker pods in ...
Either the cluster goes down first, and then worker pods are not able to connect with mysql server, or its the other way...
Read more >
Source code for airflow.executors.kubernetes_executor
You are not reading the most recent version of this documentation. ... 'airflow_configmap') # The worker pod may optionally have a valid Airflow...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found