Scheduler identifies a zombie AFTER a task has concluded successfully --> subsequently sets the task status to failed
See original GitHub issueApache Airflow version: 1.10.12
Kubernetes version (if you are using kubernetes) (use kubectl version): Server Version: version.Info{Major:“1”, Minor:“11+”, GitVersion:“v1.11.0+d4cacc0”, GitCommit:“d4cacc0”, GitTreeState:“clean”, BuildDate:“2020-07-16T18:50:14Z”, GoVersion:“go1.10.8”, Compiler:“gc”, Platform:“linux/amd64”}
Environment: Airflow, running on top of Kubernetes - RedHat OpenShift
- Cloud provider or hardware configuration:
- OS (e.g. from /etc/os-release): From the Airflow containers: fedora:28
- Kernel (e.g.
uname -a): Linux airflow-scheduler-1-xzx5j 3.10.0-1127.18.2.el7.x86_64 #1 SMP Mon Jul 20 22:32:16 UTC 2020 x86_64 x86_64 x86_64 GNU/Linux - Install tools:
- Others:
What happened:
-
A running Airflow task concludes successfully:
[2020-11-17 08:08:30,301] {local_task_job.py:102} INFO - Task exited with return code 0 -
Scheduler logs, indicates the following few seconds later:
[2020-11-17 08:08:41,932] {logging_mixin.py:112} INFO - [2020-11-17 08:08:41,932] {dagbag.py:357} INFO - Marked zombie job <TaskInstance: raas_mpad_acc_1.prepare_reprocessing_srr_mid 2020-11-17 08:07:46.496180+00:00 [failed]> as failed
[2020-11-17 08:08:48,889] {logging_mixin.py:112} INFO - [2020-11-17 08:08:48,889] {dagbag.py:357} INFO - Marked zombie job <TaskInstance: raas_mpad_acc_1.prepare_reprocessing_srr_mid 2020-11-17 08:07:46.496180+00:00 [failed]> as failed
- From the Airflow database, I can extract the latest_heartbeat:
airflow=# SELECT latest_heartbeat FROM job WHERE id = 12;
latest_heartbeat
-------------------------------
2020-11-17 08:08:30.292813+00
(1 row)
-
Our airflow.cfg, has the following configuration:
scheduler_zombie_task_threshold = 300 -
We are experiencing this randomly ever since we upgraded to Airflow 1.10.12. Before, we were running 1.10.10 and did not notice such odd behavior.
What you expected to happen: We are surprised that after a task has successfully concluded, zombie detection identifies it as a zombie and sets it to a failed state. We do not have evidence of what was the task state before it was identified as a zombie
A plausible scenario (pure speculation):
- scheduler identifies the list of running tasks
- task finishes successfully, thus process (expectedly) dies
- scheduler performs the check on the list of tasks identified by #1 --> it determines the process is killed (because of #2), thus marks it as a zombie, and re-sets its status to failed/up for retry?
How to reproduce it:
- Issue occurs sporadically, thus challenging to deterministically reproduce.
- We encountered it using atleast: SparkSubmitOperator, and PythonOperator
Anything else we need to know: It occurs sporadically. We also saw the following scenarion:
- Task concluded successfully based on the log message and exit code of the task
- Task was retried afterwards, because scheduler identified it as a zombie and marked the task as up_for_retry
Issue Analytics
- State:
- Created 3 years ago
- Comments:13 (6 by maintainers)

Top Related StackOverflow Question
Can you please open a new issue @kono10 and add all the logs etc. If you have it in 2.1.0 almost for sure some details changed and in order to investigate, we need “fresh” information (1.10.12 has been already in End of Life and the codebase is vastly different).
This might already be fixed by https://github.com/apache/airflow/pull/16289 – which sadly we didn’t include in 2.1.1, it’ll be in 2.1.2