question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Scheduler identifies a zombie AFTER a task has concluded successfully --> subsequently sets the task status to failed

See original GitHub issue

Apache Airflow version: 1.10.12

Kubernetes version (if you are using kubernetes) (use kubectl version): Server Version: version.Info{Major:“1”, Minor:“11+”, GitVersion:“v1.11.0+d4cacc0”, GitCommit:“d4cacc0”, GitTreeState:“clean”, BuildDate:“2020-07-16T18:50:14Z”, GoVersion:“go1.10.8”, Compiler:“gc”, Platform:“linux/amd64”}

Environment: Airflow, running on top of Kubernetes - RedHat OpenShift

  • Cloud provider or hardware configuration:
  • OS (e.g. from /etc/os-release): From the Airflow containers: fedora:28
  • Kernel (e.g. uname -a): Linux airflow-scheduler-1-xzx5j 3.10.0-1127.18.2.el7.x86_64 #1 SMP Mon Jul 20 22:32:16 UTC 2020 x86_64 x86_64 x86_64 GNU/Linux
  • Install tools:
  • Others:

What happened:

  • A running Airflow task concludes successfully: [2020-11-17 08:08:30,301] {local_task_job.py:102} INFO - Task exited with return code 0

  • Scheduler logs, indicates the following few seconds later:

[2020-11-17 08:08:41,932] {logging_mixin.py:112} INFO - [2020-11-17 08:08:41,932] {dagbag.py:357} INFO - Marked zombie job <TaskInstance: raas_mpad_acc_1.prepare_reprocessing_srr_mid 2020-11-17 08:07:46.496180+00:00 [failed]> as failed

[2020-11-17 08:08:48,889] {logging_mixin.py:112} INFO - [2020-11-17 08:08:48,889] {dagbag.py:357} INFO - Marked zombie job <TaskInstance: raas_mpad_acc_1.prepare_reprocessing_srr_mid 2020-11-17 08:07:46.496180+00:00 [failed]> as failed
  • From the Airflow database, I can extract the latest_heartbeat:
airflow=# SELECT latest_heartbeat FROM job WHERE id = 12;
       latest_heartbeat
-------------------------------
 2020-11-17 08:08:30.292813+00
(1 row)
  • Our airflow.cfg, has the following configuration: scheduler_zombie_task_threshold = 300

  • We are experiencing this randomly ever since we upgraded to Airflow 1.10.12. Before, we were running 1.10.10 and did not notice such odd behavior.

What you expected to happen: We are surprised that after a task has successfully concluded, zombie detection identifies it as a zombie and sets it to a failed state. We do not have evidence of what was the task state before it was identified as a zombie

A plausible scenario (pure speculation):

  1. scheduler identifies the list of running tasks
  2. task finishes successfully, thus process (expectedly) dies
  3. scheduler performs the check on the list of tasks identified by #1 --> it determines the process is killed (because of #2), thus marks it as a zombie, and re-sets its status to failed/up for retry?

How to reproduce it:

  • Issue occurs sporadically, thus challenging to deterministically reproduce.
  • We encountered it using atleast: SparkSubmitOperator, and PythonOperator

Anything else we need to know: It occurs sporadically. We also saw the following scenarion:

  • Task concluded successfully based on the log message and exit code of the task
  • Task was retried afterwards, because scheduler identified it as a zombie and marked the task as up_for_retry

Issue Analytics

  • State:closed
  • Created 3 years ago
  • Comments:13 (6 by maintainers)

github_iconTop GitHub Comments

1reaction
potiukcommented, Jun 24, 2021

Can you please open a new issue @kono10 and add all the logs etc. If you have it in 2.1.0 almost for sure some details changed and in order to investigate, we need “fresh” information (1.10.12 has been already in End of Life and the codebase is vastly different).

0reactions
ashbcommented, Jun 24, 2021

This might already be fixed by https://github.com/apache/airflow/pull/16289 – which sadly we didn’t include in 2.1.1, it’ll be in 2.1.2

Read more comments on GitHub >

github_iconTop Results From Across the Web

[GitHub] [airflow] amr-noureldin commented on issue #12396 ...
... commented on issue #12396: Scheduler identifies a zombie AFTER a task has concluded successfully --> subsequently sets the task status to failed....
Read more >
Zombie tasks - Unizin Product documentation
A zombie task is a task that Airflow's schedule believes is running but, when it checks on its status, determines that it has...
Read more >
Task Scheduler Error and Success Constants - Microsoft Learn
In this article. Requirements. If an error occurs, the Task Scheduler APIs can return one of the following error codes as an HRESULT...
Read more >
2 Server Error Message Reference - MySQL :: Developer Zone
Possible causes: Permissions problem for source file; destination file already exists but is not writeable. Error number: 1005 ; Symbol: ER_CANT_CREATE_TABLE ; ...
Read more >
432 startup failure post-mortems - CB Insights
It's hard to say goodbye. Find out how 432 startups have failed and why in the words of their investors and founders.
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found