question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Long running tasks being killed with CeleryKubernetesExecutor

See original GitHub issue

Apache Airflow version

2.2.5

What happened

Hi there, this one is a bit of a weird one to reproduce, but I’ll try my best to giive as much information as possible.

General Information

First of all, here’s some list information:

  • Airflow version: v2.2.5
  • Deployed on k8s with the user-community helm chart:
    • 2 scheduler pods
    • 5 worker pods
    • 1 flower pod
    • 2 web pods
    • Using managed Redis from DigitalOcean
  • Executor: CeleryKubernetesExecutor
  • Deployed on DigitalOcean Managed Kubernetes
  • Uses DigitalOcean Managed Postgres
  • I am using the official Airflow Docker images
  • There are no spikes in the DB metrics, Kubernetes cluster, or anything else that I could find.

These are my relevant env variables:

AIRFLOW__CELERY__WORKER_AUTOSCALE=8,4
AIRFLOW__CELERY_BROKER_TRANSPORT_OPTIONS__VISIBILITY_TIMEOUT=64800
AIRFLOW__CORE__DAGS_FOLDER=/opt/airflow/dags/repo/
AIRFLOW__CORE__EXECUTOR=CeleryKubernetesExecutor
AIRFLOW__CORE__LOAD_EXAMPLES=False
AIRFLOW__CORE__MAX_ACTIVE_RUNS_PER_DAG=1
AIRFLOW__CORE__MAX_ACTIVE_TASKS_PER_DAG=15
AIRFLOW__CORE__PARALLELISM=30
AIRFLOW__CORE__SECURE_MODE=True
AIRFLOW__KUBERNETES__DAGS_VOLUME_SUBPATH=repo
AIRFLOW__KUBERNETES__DELETE_WORKER_PODS=True
AIRFLOW__KUBERNETES__LOGS_VOLUME_CLAIM=airflow-v2-logs
AIRFLOW__KUBERNETES__NAMESPACE=airflow
AIRFLOW__KUBERNETES__POD_TEMPLATE_FILE=/opt/airflow/pod_templates/pod_template.yaml
AIRFLOW__KUBERNETES__WORKER_PODS_CREATION_BATCH_SIZE=20
AIRFLOW__LOGGING__BASE_LOG_FOLDER=/opt/airflow/logs
AIRFLOW__LOGGING__DAG_PROCESSOR_MANAGER_LOG_LOCATION=/opt/airflow/logs/dag_processor_manager/dag_processor_manager.log
AIRFLOW__SCHEDULER__CHILD_PROCESS_LOG_DIRECTORY=/opt/airflow/logs/scheduler
AIRFLOW__SCHEDULER__DAG_DIR_LIST_INTERVAL=120
AIRFLOW__SCHEDULER__MIN_FILE_PROCESS_INTERVAL=30
AIRFLOW__WEBSERVER__EXPOSE_CONFIG=False
AIRFLOW__WEBSERVER__RBAC=True
AIRFLOW__WEBSERVER__WORKER_CLASS=gevent
AIRFLOW_HOME=/opt/airflow
AIRFLOW_INSTALLATION_METHOD=apache-airflow
AIRFLOW_PIP_VERSION=21.3.1
AIRFLOW_USER_HOME_DIR=/home/airflow
AIRFLOW_VERSION=2.2.5

The issue that I will be describing here started happening a week ago after I have moved from KubernetesExecutor to CeleryKubernetesExecutor, so it must have something to do with it.

Problem Statement

I have some DAGs that have some long-running tasks: be it sensors that take hours to complete, or large SQL queries that take a very long time. Given that the sensors are waiting hours in many cases, we use reschedule for the sensors; however, the long running SQL queries cannot be executed that way unfortunately, therefore the tasks stay open.

Here’s a sample log to show how the logs look when a query is executed successfully:

[2022-05-26, 05:25:41 ] {cursor.py:705} INFO - query: [SELECT * FROM users WHERE...]
[2022-05-26, 05:57:22 ] {cursor.py:729} INFO - query execution done

Here’s a sample log for a task that started at 2022-05-26, 05:25:37, that actually demonstrates the problem where the task runs for a longer time:

[2022-05-26, 05:57:22 ] {cursor.py:705} INFO - query: [----- CREATE OR REPLACE TABLE table1 AS WITH users AS ( ...]
[2022-05-26, 06:59:41 ] {taskinstance.py:1033} INFO - Dependencies not met for <TaskInstance: mycompany.task1 scheduled__2022-05-25T01:00:00+00:00 [running]>, dependency 'Task Instance State' FAILED: Task is in the 'running' state.
[2022-05-26, 06:59:41 ] {taskinstance.py:1033} INFO - Dependencies not met for <TaskInstance: mycompany.task1 scheduled__2022-05-25T01:00:00+00:00 [running]>, dependency 'Task Instance Not Running' FAILED: Task is in the running state
[2022-05-26, 06:59:41 ] {local_task_job.py:99} INFO - Task is not able to be run

Apparently, when the task runs for a longer time, it is being killed. It is not just happening with a single instance time, but with many others, therefore it is not an operator-specific issue. There are no timeouts, and no additional configuration defined on the individual tasks.

Some additional interesting observations:

  • For all those tasks that are killed, I am seeing the same log: Task is not able to be run
  • For these tasks, the retry counts are going above the retries being set for the DAG as well.
    • The DAG has 3 retries configured, and there’ll be usually 4 instances running.
    • This smells like a race condition somewhere, but not sure.

Unfortunately, I don’t have the scheduler logs, but I am on the lookout for them.

As I have mentioned, this has only started happening after I switched to CeleryKubernetesExecutor. I’d love to investigate this further, and it is causing a lot of pain now so I might need to get back to Kubernetes Executor, but I really don’t want to given that KubernetesExecutor is much slower than CeleryKubernetesExecutor due to git clone happening on every task.

Let me know if I can provide additional information, I am trying to find more patterns and details around this so that we can fix this issue, so any leads around what should be looked at is much appreciated.

What you think should happen instead

The tasks should keep running until they are finished.

How to reproduce

I really don’t know, sorry. I have tried my best to explain the situation above.

Operating System

Debian GNU/Linux 10 (buster)

Versions of Apache Airflow Providers

apache-airflow-providers-amazon==3.2.0
apache-airflow-providers-celery==2.1.3
apache-airflow-providers-cncf-kubernetes==3.0.0
apache-airflow-providers-docker==2.5.2
apache-airflow-providers-elasticsearch==2.2.0
apache-airflow-providers-ftp==2.1.2
apache-airflow-providers-google==6.7.0
apache-airflow-providers-grpc==2.0.4
apache-airflow-providers-hashicorp==2.1.4
apache-airflow-providers-http==2.1.2
apache-airflow-providers-imap==2.2.3
apache-airflow-providers-microsoft-azure==3.7.2
apache-airflow-providers-microsoft-mssql==2.0.1
apache-airflow-providers-mysql==2.2.3
apache-airflow-providers-odbc==2.0.4
apache-airflow-providers-postgres==4.1.0
apache-airflow-providers-redis==2.0.4
apache-airflow-providers-sendgrid==2.0.4
apache-airflow-providers-sftp==2.5.2
apache-airflow-providers-slack==4.2.3
apache-airflow-providers-snowflake==2.3.0
apache-airflow-providers-sqlite==2.1.3
apache-airflow-providers-ssh==2.3.0
apache-airflow-providers-tableau==2.1.2

Deployment

Other 3rd-party Helm chart

Deployment details

No response

Anything else

No response

Are you willing to submit PR?

  • Yes I am willing to submit a PR!

Code of Conduct

Issue Analytics

  • State:closed
  • Created a year ago
  • Reactions:2
  • Comments:14 (6 by maintainers)

github_iconTop GitHub Comments

2reactions
karakanbcommented, Jun 6, 2022

@tanelk it is possible, I don’t know tbh. I have just upgraded to v2.3.2, I’ll observe for some time to see if it is fixed.

0reactions
tanelkcommented, Jun 15, 2022

The symptoms sounds exactly like #23048, but that should be fixed in 2.3.2. Scheduler logs from around the time of failure would be the next place I would look at - looks like something is re-scheduling the task.

Read more comments on GitHub >

github_iconTop Results From Across the Web

Long running tasks being killed with CeleryKubernetesExecutor
I have some DAGs that have some long-running tasks: be it sensors that take hours to complete, or large SQL queries that take...
Read more >
[GitHub] [airflow] karakanb commented on issue #24466: Long ...
[GitHub] [airflow] karakanb commented on issue #24466: Long running tasks being killed with CeleryKubernetesExecutor.
Read more >
Kubernetes Executor — Airflow Documentation
The Kubernetes executor runs each task instance in its own pod on a Kubernetes cluster. KubernetesExecutor runs as a process in the Airflow...
Read more >
How to safely restart Airflow and kill a long-running task?
1 Answer 1 · The spark job is still running. Airflow worker is being restarted, which creates a zombie process. · I updated...
Read more >
https://raw.githubusercontent.com/apache/incubator...
To calculate # the number of tasks that is running concurrently for a DAG, ... to disable pickling dags donot_pickle = True #...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found