question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Scheduler "deadlocks" itself when max_active_runs_per_dag is reached by up_for_retry tasks

See original GitHub issue

Apache Airflow version: 2.0.1

What happened:

Let’s say we have max_active_runs_per_dag = 2 in config. Now we manually trigger, for example, 10 DAG runs for some specific DAG. In the DAG there are some tasks, that should be retried on fail with some interval.

The issue is when at least 2 DAG runs have tasks inside that are failed, moved to up_for_retry state, and waiting to be rescheduled again, the scheduler will not reschedule them at all. In stdout it keeps saying that DAG <dag_name> already has 2 active DAG runs, not queuing any tasks for run <execution_date>. Even DAG runs inside other DAGs stop to run

Executor: CeleryExecutor

What you expected to happen:

I expected that up_for_retry tasks would be rescheduled when they reached their retry interval

How to reproduce it:

Just follow the instructions above. Set max_active_runs_per_dag = 2, create a DAG with PythonOperator with the function inside that fails, set retry_delay to something like 1 minute, trigger manually 2 DAG runs, and verify that task wouldn’t be rescheduled on delay

Issue Analytics

  • State:closed
  • Created 3 years ago
  • Reactions:9
  • Comments:27 (16 by maintainers)

github_iconTop GitHub Comments

4reactions
jecolvincommented, Feb 18, 2021

I think the issue I’m experiencing is related to this.

Apache Airflow version: 2.0.1 Executor: LocalExecutor

What happened: I have max_active set to 4, and when running a backfill for this dag, if 4 sensor tasks get set for up_for_reschedule at the same time, the backfill exits telling me that all the tasks downstream for these sensors are deadlocked.

3reactions
ephraimbuddycommented, Sep 2, 2021

I have made a PR related to this issue, see https://github.com/apache/airflow/pull/17945

What happens is that the method DagRun.next_dagruns_to_examine gets the earliest dagruns without considering the dag that has the dagrun. For example: If you have a dag with execution_date 2020,1,1 and set catchup=True, max_active_runs=1, schedule_interval=‘@daily’ and another dag with execution_date 2021,1,1 and also set catchup=True, schedule_interval=‘@daily’. When you unpause the two dags(the one with max_active_runs first), the dagruns would be created but only one dagrun would be active because of how DagRun.next_dagruns_to_examine works. I’m hopeful my PR would resolve this issue but I’m worried about performance. Please take a look: https://github.com/apache/airflow/pull/17945 @uranusjr @kaxil @ash

Read more comments on GitHub >

github_iconTop Results From Across the Web

Apache Airflow tasks are stuck in a 'up_for_retry' state
I am just constantly met with this message: Task is not ready for retry yet but will be retried automatically. Current date is...
Read more >
[GitHub] [airflow] antontimenko commented on issue #14205 ...
[GitHub] [airflow] antontimenko commented on issue #14205: Scheduler "deadlocks" itself when max_active_runs_per_dag is reached by up_for_retry tasks.
Read more >
Possibility of transaction deadlock - IBM
As shown in Figure 1, transaction deadlock means that two (or more) tasks cannot proceed because each task is waiting for the release...
Read more >
7 Common Errors to Check When Debugging Airflow DAGs
7 Common Errors to Check When Debugging Airflow DAGs. Tasks not running? DAG stuck? Logs nowhere to be found? We've been there.
Read more >
Tasks — Airflow Documentation
Task Instances¶ · none : The Task has not yet been queued for execution (its dependencies are not yet met) · scheduled :...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found