question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

DAG is stuck in running state although Kubernetes Pod is terminated and completed for longer running tasks in AKS

See original GitHub issue

Apache Airflow version: 2.0.1

Kubernetes version : 1.19.7

Environment:

  • Cloud provider or hardware configuration: Azure Kubernetes Service(AKS)
  • OS (e.g. from /etc/os-release): Debian GNU/Linux 10 (buster)
  • Kernel (e.g. uname -a): Linux airflow-scheduler-db9fd5df6-6475f 5.4.0-1040-azure #42~18.04.1-Ubuntu SMP Mon Feb 8 19:05:32 UTC 2021 x86_64 GNU/Linux

What happened:

I have configured AKS Kubernetes Cluster where there is an Airflow Scheduler pod and an Airflow Scheduler pod. I have written a DAG that will run using the Kubernetes Pod Operator.

When the DAG is triggered a pod is created and the steps inside the DAG start running. For the DAGS who take a short amount of time to finish doesn’t cause any issue for longer running tasks when the task is done in the pod and the pod moves to a terminated/completed state the airflow webserver/scheduler seems to not receive that information. For this reason, the DAG status is always running in the webserver and I have to manually mark that DAG as success to go forward.

I observed both the Kubernetes pod logs and the logs that we can see from the web UI for that task. The logs that I can see from the webserver are lagging behind the logs in the pod.

What you expected to happen:

When the POD is terminated/completed the airflow scheduler should have that information and mark that job as a success.

How to reproduce it:

To mimic a long-running job I have written a small dag where I used a sleep command.

Here is the DAG -

import time
from datetime import datetime
import logging


def long_task():
    time.sleep(300)


if __name__ == '__main__':
    loop_step_count = 12

    for i in range(loop_step_count):
        logging.info(f'Loop Count {i} Current timestamp {datetime.utcnow()}')
        long_task()
        

and this is how I defined my Kubernetes Pod Operator.

from airflow import DAG
from datetime import datetime, timedelta
from airflow.providers.cncf.kubernetes.operators.kubernetes_pod import KubernetesPodOperator

default_args = {
    'owner': 'airflow',
    'description': 'Pipeline for testing airflow aks termination issue',
    'depend_on_past': False,
    'start_date': datetime(2020, 2, 5),
    'email_on_failure': False,
    'email_on_retry': False,
    'retries': 5,
    'retry_delay': timedelta(minutes=1),
}

with DAG('Termination_Issue_AKS_Airflow',
         default_args=default_args,
         catchup=False) as dag:

    step = KubernetesPodOperator(
        namespace='airflow',
        name="Termination_Issue_Aks_Airflow",
        task_id="Termination_Issue_Aks_Airflow",
        image="shihabcsedu09/termination_issue_aks_airflow:latest",
        image_pull_policy='Always',
        get_logs=True,
        log_events_on_failure=True,
        is_delete_operator_pod=True,
        node_selector={'agentpool': 'airflowtasks'},
        termination_grace_period=60,
        startup_timeout_seconds=900,
    )

    step

The image for this DAG step is in in Dockerhub and is public. (Link). The related Dockerfile is like this.

FROM python:3.7

COPY termination_issue_aks_airflow.py .

CMD [ "python", "./termination_issue_aks_airflow.py" ]

and this is how I am publishing it.

docker build --no-cache \
  -t termination_issue_aks_airflow \
  -f Dockerfile .

docker tag termination_issue_aks_airflow shihabcsedu09/termination_issue_aks_airflow:latest

docker push shihabcsedu09/termination_issue_aks_airflow:latest

Important to know -

In my Kubernetes cluster there are two node pools

  1. default node pool - In this pool I deployed my airflow scheduler and airflow webserver pod.
  2. airflowtasks node pool - In this pool the steps of my DAGs run. You can see I used node_selector={'agentpool': 'airflowtasks'} to define in which node pool the DAG will run.

If you are using kubernetes, please attempt to recreate the issue using minikube or kind.

Install minikube/kind

Issue Analytics

  • State:closed
  • Created 3 years ago
  • Comments:9 (4 by maintainers)

github_iconTop GitHub Comments

2reactions
shihabcsedu09commented, May 15, 2021

@jedcunningham I can now confirm the issue isn’t related to AKS. Enabling TCP keep alive did the trick.

Thanks for all the help. This issue can be closed

1reaction
shihabcsedu09commented, Apr 5, 2021

@turbaszek Do you know if any development work related to this is happening?

Thanks in advance

Read more comments on GitHub >

github_iconTop Results From Across the Web

Azure Devops build pipeline is not waiting to complete the task
We have released a new task called Kubernetes Manifest which waits until the deployment is stable. i.e. the pods are in running state,...
Read more >
AKS node pool stuck in "starting" state after stop
The node pool ist stuck in state "starting" for several hours. Deleting the cluster took a very long time, but for some it...
Read more >
Troubleshooting GitLab Runner
If GitLab Runner is running as a service on Windows, it creates system event logs. To view them, open the Event Viewer (from...
Read more >
Kubernetes Pods stuck with in 'Terminating' state
I'm using AKS, and I'm unsure how to check if the pod was actually deleted from the node right now, but I will...
Read more >
Kubernetes Executor — Airflow Documentation
When a DAG submits a task, the KubernetesExecutor requests a worker pod from the Kubernetes API. The worker pod then runs the task,...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found