Scheduler "deadlocks" itself when max_active_runs_per_dag is reached by up_for_retry tasks
See original GitHub issueApache Airflow version: 2.0.1
What happened:
Let’s say we have max_active_runs_per_dag = 2
in config. Now we manually trigger, for example, 10 DAG runs for some specific DAG. In the DAG there are some tasks, that should be retried on fail with some interval.
The issue is when at least 2 DAG runs have tasks inside that are failed, moved to up_for_retry state, and waiting to be rescheduled again, the scheduler will not reschedule them at all. In stdout it keeps saying that DAG <dag_name> already has 2 active DAG runs, not queuing any tasks for run <execution_date>
. Even DAG runs inside other DAGs stop to run
Executor: CeleryExecutor
What you expected to happen:
I expected that up_for_retry tasks would be rescheduled when they reached their retry interval
How to reproduce it:
Just follow the instructions above. Set max_active_runs_per_dag = 2
, create a DAG with PythonOperator with the function inside that fails, set retry_delay to something like 1 minute, trigger manually 2 DAG runs, and verify that task wouldn’t be rescheduled on delay
Issue Analytics
- State:
- Created 3 years ago
- Reactions:9
- Comments:27 (16 by maintainers)
Top GitHub Comments
I think the issue I’m experiencing is related to this.
Apache Airflow version: 2.0.1 Executor: LocalExecutor
What happened: I have
max_active
set to 4, and when running a backfill for this dag, if 4 sensor tasks get set forup_for_reschedule
at the same time, the backfill exits telling me that all the tasks downstream for these sensors are deadlocked.I have made a PR related to this issue, see https://github.com/apache/airflow/pull/17945
What happens is that the method
DagRun.next_dagruns_to_examine
gets the earliest dagruns without considering the dag that has the dagrun. For example: If you have a dag with execution_date 2020,1,1 and set catchup=True, max_active_runs=1, schedule_interval=‘@daily’ and another dag with execution_date 2021,1,1 and also set catchup=True, schedule_interval=‘@daily’. When you unpause the two dags(the one with max_active_runs first), the dagruns would be created but only one dagrun would be active because of howDagRun.next_dagruns_to_examine
works. I’m hopeful my PR would resolve this issue but I’m worried about performance. Please take a look: https://github.com/apache/airflow/pull/17945 @uranusjr @kaxil @ash