Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Scheduler identifies a zombie AFTER a task has concluded successfully --> subsequently sets the task status to failed

See original GitHub issue

Apache Airflow version: 1.10.12

Kubernetes version (if you are using kubernetes) (use kubectl version): Server Version: version.Info{Major:“1”, Minor:“11+”, GitVersion:“v1.11.0+d4cacc0”, GitCommit:“d4cacc0”, GitTreeState:“clean”, BuildDate:“2020-07-16T18:50:14Z”, GoVersion:“go1.10.8”, Compiler:“gc”, Platform:“linux/amd64”}

Environment: Airflow, running on top of Kubernetes - RedHat OpenShift

Cloud provider or hardware configuration:
OS (e.g. from /etc/os-release): From the Airflow containers: fedora:28
Kernel (e.g. uname -a): Linux airflow-scheduler-1-xzx5j 3.10.0-1127.18.2.el7.x86_64 #1 SMP Mon Jul 20 22:32:16 UTC 2020 x86_64 x86_64 x86_64 GNU/Linux
Install tools:
Others:

What happened:

A running Airflow task concludes successfully: [2020-11-17 08:08:30,301] {local_task_job.py:102} INFO - Task exited with return code 0
Scheduler logs, indicates the following few seconds later:

[2020-11-17 08:08:41,932] {logging_mixin.py:112} INFO - [2020-11-17 08:08:41,932] {dagbag.py:357} INFO - Marked zombie job <TaskInstance: raas_mpad_acc_1.prepare_reprocessing_srr_mid 2020-11-17 08:07:46.496180+00:00 [failed]> as failed

[2020-11-17 08:08:48,889] {logging_mixin.py:112} INFO - [2020-11-17 08:08:48,889] {dagbag.py:357} INFO - Marked zombie job <TaskInstance: raas_mpad_acc_1.prepare_reprocessing_srr_mid 2020-11-17 08:07:46.496180+00:00 [failed]> as failed

From the Airflow database, I can extract the latest_heartbeat:

airflow=# SELECT latest_heartbeat FROM job WHERE id = 12;
       latest_heartbeat
-------------------------------
 2020-11-17 08:08:30.292813+00
(1 row)

Our airflow.cfg, has the following configuration: scheduler_zombie_task_threshold = 300
We are experiencing this randomly ever since we upgraded to Airflow 1.10.12. Before, we were running 1.10.10 and did not notice such odd behavior.

What you expected to happen: We are surprised that after a task has successfully concluded, zombie detection identifies it as a zombie and sets it to a failed state. We do not have evidence of what was the task state before it was identified as a zombie

A plausible scenario (pure speculation):

scheduler identifies the list of running tasks
task finishes successfully, thus process (expectedly) dies
scheduler performs the check on the list of tasks identified by #1 --> it determines the process is killed (because of #2), thus marks it as a zombie, and re-sets its status to failed/up for retry?

How to reproduce it:

Issue occurs sporadically, thus challenging to deterministically reproduce.
We encountered it using atleast: SparkSubmitOperator, and PythonOperator

Anything else we need to know: It occurs sporadically. We also saw the following scenarion:

Task concluded successfully based on the log message and exit code of the task
Task was retried afterwards, because scheduler identified it as a zombie and marked the task as up_for_retry

Issue Analytics

State:
Created 3 years ago
Comments:13 (6 by maintainers)

Top GitHub Comments

1reaction

potiukcommented, Jun 24, 2021

Can you please open a new issue @kono10 and add all the logs etc. If you have it in 2.1.0 almost for sure some details changed and in order to investigate, we need “fresh” information (1.10.12 has been already in End of Life and the codebase is vastly different).

0reactions

ashbcommented, Jun 24, 2021

This might already be fixed by https://github.com/apache/airflow/pull/16289 – which sadly we didn’t include in 2.1.1, it’ll be in 2.1.2

Top Results From Across the Web

[GitHub] [airflow] amr-noureldin commented on issue #12396 ...

... commented on issue #12396: Scheduler identifies a zombie AFTER a task has concluded successfully --> subsequently sets the task status to failed....

Zombie tasks - Unizin Product documentation

A zombie task is a task that Airflow's schedule believes is running but, when it checks on its status, determines that it has...

Task Scheduler Error and Success Constants - Microsoft Learn

In this article. Requirements. If an error occurs, the Task Scheduler APIs can return one of the following error codes as an HRESULT...

2 Server Error Message Reference - MySQL :: Developer Zone

Possible causes: Permissions problem for source file; destination file already exists but is not writeable. Error number: 1005 ; Symbol: ER_CANT_CREATE_TABLE ; ...

432 startup failure post-mortems - CB Insights

It's hard to say goodbye. Find out how 432 startups have failed and why in the words of their investors and founders.