Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Long running tasks being killed with CeleryKubernetesExecutor

See original GitHub issue

Apache Airflow version

2.2.5

What happened

Hi there, this one is a bit of a weird one to reproduce, but I’ll try my best to giive as much information as possible.

General Information

First of all, here’s some list information:

Airflow version: v2.2.5
Deployed on k8s with the user-community helm chart:
- 2 scheduler pods
- 5 worker pods
- 1 flower pod
- 2 web pods
- Using managed Redis from DigitalOcean
Executor: CeleryKubernetesExecutor
Deployed on DigitalOcean Managed Kubernetes
Uses DigitalOcean Managed Postgres
I am using the official Airflow Docker images
There are no spikes in the DB metrics, Kubernetes cluster, or anything else that I could find.

These are my relevant env variables:

AIRFLOW__CELERY__WORKER_AUTOSCALE=8,4
AIRFLOW__CELERY_BROKER_TRANSPORT_OPTIONS__VISIBILITY_TIMEOUT=64800
AIRFLOW__CORE__DAGS_FOLDER=/opt/airflow/dags/repo/
AIRFLOW__CORE__EXECUTOR=CeleryKubernetesExecutor
AIRFLOW__CORE__LOAD_EXAMPLES=False
AIRFLOW__CORE__MAX_ACTIVE_RUNS_PER_DAG=1
AIRFLOW__CORE__MAX_ACTIVE_TASKS_PER_DAG=15
AIRFLOW__CORE__PARALLELISM=30
AIRFLOW__CORE__SECURE_MODE=True
AIRFLOW__KUBERNETES__DAGS_VOLUME_SUBPATH=repo
AIRFLOW__KUBERNETES__DELETE_WORKER_PODS=True
AIRFLOW__KUBERNETES__LOGS_VOLUME_CLAIM=airflow-v2-logs
AIRFLOW__KUBERNETES__NAMESPACE=airflow
AIRFLOW__KUBERNETES__POD_TEMPLATE_FILE=/opt/airflow/pod_templates/pod_template.yaml
AIRFLOW__KUBERNETES__WORKER_PODS_CREATION_BATCH_SIZE=20
AIRFLOW__LOGGING__BASE_LOG_FOLDER=/opt/airflow/logs
AIRFLOW__LOGGING__DAG_PROCESSOR_MANAGER_LOG_LOCATION=/opt/airflow/logs/dag_processor_manager/dag_processor_manager.log
AIRFLOW__SCHEDULER__CHILD_PROCESS_LOG_DIRECTORY=/opt/airflow/logs/scheduler
AIRFLOW__SCHEDULER__DAG_DIR_LIST_INTERVAL=120
AIRFLOW__SCHEDULER__MIN_FILE_PROCESS_INTERVAL=30
AIRFLOW__WEBSERVER__EXPOSE_CONFIG=False
AIRFLOW__WEBSERVER__RBAC=True
AIRFLOW__WEBSERVER__WORKER_CLASS=gevent
AIRFLOW_HOME=/opt/airflow
AIRFLOW_INSTALLATION_METHOD=apache-airflow
AIRFLOW_PIP_VERSION=21.3.1
AIRFLOW_USER_HOME_DIR=/home/airflow
AIRFLOW_VERSION=2.2.5

The issue that I will be describing here started happening a week ago after I have moved from KubernetesExecutor to CeleryKubernetesExecutor, so it must have something to do with it.

Problem Statement

I have some DAGs that have some long-running tasks: be it sensors that take hours to complete, or large SQL queries that take a very long time. Given that the sensors are waiting hours in many cases, we use reschedule for the sensors; however, the long running SQL queries cannot be executed that way unfortunately, therefore the tasks stay open.

Here’s a sample log to show how the logs look when a query is executed successfully:

[2022-05-26, 05:25:41 ] {cursor.py:705} INFO - query: [SELECT * FROM users WHERE...]
[2022-05-26, 05:57:22 ] {cursor.py:729} INFO - query execution done

Here’s a sample log for a task that started at 2022-05-26, 05:25:37, that actually demonstrates the problem where the task runs for a longer time:

[2022-05-26, 05:57:22 ] {cursor.py:705} INFO - query: [----- CREATE OR REPLACE TABLE table1 AS WITH users AS ( ...]
[2022-05-26, 06:59:41 ] {taskinstance.py:1033} INFO - Dependencies not met for <TaskInstance: mycompany.task1 scheduled__2022-05-25T01:00:00+00:00 [running]>, dependency 'Task Instance State' FAILED: Task is in the 'running' state.
[2022-05-26, 06:59:41 ] {taskinstance.py:1033} INFO - Dependencies not met for <TaskInstance: mycompany.task1 scheduled__2022-05-25T01:00:00+00:00 [running]>, dependency 'Task Instance Not Running' FAILED: Task is in the running state
[2022-05-26, 06:59:41 ] {local_task_job.py:99} INFO - Task is not able to be run

Apparently, when the task runs for a longer time, it is being killed. It is not just happening with a single instance time, but with many others, therefore it is not an operator-specific issue. There are no timeouts, and no additional configuration defined on the individual tasks.

Some additional interesting observations:

For all those tasks that are killed, I am seeing the same log: Task is not able to be run
For these tasks, the retry counts are going above the retries being set for the DAG as well.
- The DAG has 3 retries configured, and there’ll be usually 4 instances running.
- This smells like a race condition somewhere, but not sure.

Unfortunately, I don’t have the scheduler logs, but I am on the lookout for them.

As I have mentioned, this has only started happening after I switched to CeleryKubernetesExecutor. I’d love to investigate this further, and it is causing a lot of pain now so I might need to get back to Kubernetes Executor, but I really don’t want to given that KubernetesExecutor is much slower than CeleryKubernetesExecutor due to git clone happening on every task.

Let me know if I can provide additional information, I am trying to find more patterns and details around this so that we can fix this issue, so any leads around what should be looked at is much appreciated.

What you think should happen instead

The tasks should keep running until they are finished.

How to reproduce

I really don’t know, sorry. I have tried my best to explain the situation above.

Operating System

Debian GNU/Linux 10 (buster)

Versions of Apache Airflow Providers

apache-airflow-providers-amazon==3.2.0
apache-airflow-providers-celery==2.1.3
apache-airflow-providers-cncf-kubernetes==3.0.0
apache-airflow-providers-docker==2.5.2
apache-airflow-providers-elasticsearch==2.2.0
apache-airflow-providers-ftp==2.1.2
apache-airflow-providers-google==6.7.0
apache-airflow-providers-grpc==2.0.4
apache-airflow-providers-hashicorp==2.1.4
apache-airflow-providers-http==2.1.2
apache-airflow-providers-imap==2.2.3
apache-airflow-providers-microsoft-azure==3.7.2
apache-airflow-providers-microsoft-mssql==2.0.1
apache-airflow-providers-mysql==2.2.3
apache-airflow-providers-odbc==2.0.4
apache-airflow-providers-postgres==4.1.0
apache-airflow-providers-redis==2.0.4
apache-airflow-providers-sendgrid==2.0.4
apache-airflow-providers-sftp==2.5.2
apache-airflow-providers-slack==4.2.3
apache-airflow-providers-snowflake==2.3.0
apache-airflow-providers-sqlite==2.1.3
apache-airflow-providers-ssh==2.3.0
apache-airflow-providers-tableau==2.1.2

Deployment

Other 3rd-party Helm chart

Deployment details

No response

Anything else

No response

Are you willing to submit PR?

Yes I am willing to submit a PR!

Code of Conduct

I agree to follow this project’s Code of Conduct

Issue Analytics

State:
Created a year ago
Reactions:2
Comments:14 (6 by maintainers)

Top GitHub Comments

2reactions

karakanbcommented, Jun 6, 2022

@tanelk it is possible, I don’t know tbh. I have just upgraded to v2.3.2, I’ll observe for some time to see if it is fixed.

0reactions

tanelkcommented, Jun 15, 2022

The symptoms sounds exactly like #23048, but that should be fixed in 2.3.2. Scheduler logs from around the time of failure would be the next place I would look at - looks like something is re-scheduling the task.