question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Clearing tasks for previously finished DAG runs in airflow 2.0 does not lead to scheduling of tasks (when max_active_runs is reached)

See original GitHub issue

Apache Airflow version: 2.0, LocalExecutor

Environment: Docker on Windows 10 with WSL using image apache/airflow:2.0.0-python3.8

What happened:

Situation:

  • There is a DAG, say mydag, with
    • catchup=True
    • max_active_runs=1
  • Let’s say there are two DAG runs, t=0 and t=1
  • The first task of the DAG is a sensor that senses if the previous DAG was succesful
  • Now, t=0 gets run, tasks are scheduled, and a task in t=0 fails
  • Then, t=1 gets run, and the first task - the sensor - cannot sense the successful task, thus keeps sensing
  • Now I clear the failed task in t=0 and expect that this would run, as it did in airflow 1.x
  • It doesn’t - instead the scheduler gives the following:
scheduler_1  | [2020-12-31 15:25:32,770] {scheduler_job.py:1667} INFO - DAG mydag already has 1 active runs, not queuing any tasks for run 2020-12-26 05:00:00+00:00 [note: this is t=0]

Thus, t=0 never finishes and t=1 never sensed the finished run, and any t=n with n>1 also have no chance of ever succeeding.

One alternative would be to remove the max_active_runs constraint, but that is not feasible, as this would create hundreds of DAG runs at once and that is a complete and total performance killer.

What you expected to happen:

As with previous airflow versions, I would expect that the cleared tasks get scheduled again, which they don’t.

Why this happens:

tl;dr Ultimately, this happens because airflow uses TI instead of DR here: https://github.com/apache/airflow/blob/v2-0-stable/airflow/jobs/scheduler_job.py#L1499-L1509

_do_scheduling() runs _schedule_dag_run() once for each dag_id, and gives the set of active dag runs as arg, here: https://github.com/apache/airflow/blob/v2-0-stable/airflow/jobs/scheduler_job.py#L1515. The tasks that should be queued are not queued because the dag runs are not in the abovementioned set of active dag runs. This is in spite of the fact that they are running. This is because https://github.com/apache/airflow/blob/v2-0-stable/airflow/jobs/scheduler_job.py#L1499-L1509 looks at all TaskInstances of that dagrun and their execution date instead of looking at the DagRuns, and since the tasks were successfull or failed and then cleared, they are filtered out in the query. If you replace TI with DR in that query, this should work perfectly fine, without breaking anything that currently works and fixing this issue.

How to reproduce it:

You don’t need to have the sensor logic I described above to reproduce this behavior. While I didn’t do this, the following should reproduce the behavior:

  • Create a DAG mydag with catchup=True and max_active_runs=1
  • Just have a dummy task or something, let it run a couple of times so you have a couple of successful DAG states
  • Pause the DAG*
  • Clear a couple of tasks in dag runs that were successful
  • run this snippet to see the result of the query with TI and DR, respectively
from airflow import models, settings
from airflow.utils.state import State
TI = models.TaskInstance
DR = models.DagRun
dag_id = "mydag"

result = "\n\nactive DAG runs according to current code logic:"
for data_tuple in settings.Session().query(TI.dag_id, TI.execution_date).filter(TI.dag_id.in_([dag_id]), TI.state.notin_(list(State.finished))):
 result += "\n\t" + str(data_tuple)

result += "\n\nactive DAG runs according to my proposed code logic:"
for data_tuple in settings.Session().query(DR.dag_id, DR.execution_date).filter(DR.dag_id.in_([dag_id]), DR.state.in_([State.RUNNING])):
 result += "\n\t" + str(data_tuple)

print(result, "\n")

*Pausing of the DAG only avoids that your airflow instance works through the dag runs one-by-one; you would not need to pause if your DAG has a sensor that senses the success of the previous DAG like mine do.

I will be creating a PR with the suggested fix shortly.

Issue Analytics

  • State:closed
  • Created 3 years ago
  • Reactions:6
  • Comments:11 (10 by maintainers)

github_iconTop GitHub Comments

1reaction
ephraimbuddycommented, Aug 30, 2021
0reactions
soltanianalyticscommented, Aug 30, 2021

Note that in the issue I described above, all RagRuns are already running, but tasks are not scheduled

Read more comments on GitHub >

github_iconTop Results From Across the Web

FAQ — Airflow Documentation
The scheduler only evaluates running DagRuns to see what task instances it can trigger. Note that clearing tasks instances (from the UI or...
Read more >
7 Common Errors to Check When Debugging Airflow DAGs
Tasks not running ? DAG stuck? Logs nowhere to be found? We've been there. Here's a list of common snags and some corresponding...
Read more >
How to limit Airflow to run only one instance of a DAG run at a ...
If the tasks that your dag is running are actually sub-dags then you may need to pass max_active_runs into the subdags too but...
Read more >
Troubleshooting DAGs | Cloud Composer
As a result, the scheduler waits longer for a task to be finished, ... you can use the [core]max-active-runs-per-dag Airflow configuration option to...
Read more >
DAGs, Operators, Connections, and other issues in Apache ...
Describes common errors and resolutions to Apache Airflow v2 Python ... If the scheduler is not running, it might be due to a...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found