question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Triggered tasks from web-ui are killed by scheduler because they are orphaned task instances

See original GitHub issue

Apache Airflow version

2.2.2

What happened

I have a dag which is scheduled every day and has depends_on_past set to true. When a task fails for a given date, which is expected as the required data is not there. If I want to manually run the next task (for the following day) the run always fails.

The reason for this is that airflow creates next task instances for the following days and sets their state to None as they cannot be scheduled because the previous task instance is in failed state, this is correct and expected behavior. If I now manually trigger the task_run from the web-ui, the task_instance data in the database is not updated and as a consequence the queued_by_job_id is not filled in.

Every 5 minutes the airflow scheduler queries for orphaned tasks, and since my manual run does not have queued_by_job_id filled in, it always gets killed as the scheduler thinks it is orphaned. The scheduler shows the following logs: airflow-scheduler [2022-01-20 11:15:41,784] {scheduler_job.py:1178} INFO - Reset the following 1 orphaned TaskInstances: airflow-scheduler <TaskInstance: testmanual.sample scheduled__2022-01-16T00:00:00+00:00 [running]>

What you expected to happen

I expect that the manual run will not be killed by the scheduler as it thinks it is orphaned and thus that my tasks can still succeed.

If this is expected behavior, It is best to show an error to the user when it tries to run the request stating: Running this task manually is not supported because… Then at least it is clear to the user, now the actual reason is really hidden and not obvious for most users I assume.

How to reproduce

  • Create a dag that fails for the start_date and for the following dates just sleeps for 5 minutes and set the start date to yesterday.
  • Enable the dag and the run for yesterday will fail (expected)
  • Manually trigger the dag for today through the web-ui
  • The run for today will be killed by the scheduler every time

Operating System

debian buster

Versions of Apache Airflow Providers

/

Deployment

Other Docker-based deployment

Deployment details

We run airflow on kubernetes and thus use the kubernetes_executor to schedule tasks.

Anything else

No response

Are you willing to submit PR?

  • Yes I am willing to submit a PR!

Code of Conduct

Issue Analytics

  • State:open
  • Created 2 years ago
  • Reactions:4
  • Comments:10 (10 by maintainers)

github_iconTop GitHub Comments

1reaction
ashbcommented, Jan 24, 2022

I do remember we already made some changes in the orphan code, so there is a chance this has already been fixed.

0reactions
hterikcommented, Nov 5, 2022

@potiuk I will take a look at it. I am not that familiar with the inner workings of airflow, but I think that the query for orphaned tasks is wrong. At the moment it does:

session.query(TI)
                        .filter(TI.state.in_(resettable_states))
                        .outerjoin(TI.queued_by_job)
                        .filter(or_(TI.queued_by_job_id.is_(None), SchedulerJob.state != State.RUNNING))
                        .join(TI.dag_run)
                        .filter(
                            DagRun.run_type != DagRunType.BACKFILL_JOB,
                            DagRun.state == State.RUNNING,
                        )

and I think that in my case the DagRun.run_type == DagRunType.Manual. I will validate this and supply a fix for it.

I’ve been debugging this and #25021 for a while now and my prime suspect is also this query. But rather the SchedulerJob.state != State.RUNNING part. I think that condition should only be used on the periodic call of adopt_or_reset_orphaned_tasks(). But first time on scheduler startup, it should pick up everything, including things that were previously in running state. It might not be entirely same cause as for this issue but could be good to be aware of. Edit: wrong conclusion

Read more comments on GitHub >

github_iconTop Results From Across the Web

Airflow 1.9.0 is queuing but not launching tasks - Stack Overflow
Hi @jack - To restart the scheduler, press Ctrl-C with it in the foreground to kill the process (like killing any other foreground...
Read more >
Configuration Reference — Airflow Documentation
This is used in Airflow to keep track of the running tasks and if a Scheduler is restarted or run in HA mode,...
Read more >
[GitHub] [airflow] totalhack opened a new issue #13625
[scheduler] # Task instances listen for external kill signal (when you ... The scheduler constantly tries to trigger new tasks (look at the...
Read more >
Troubleshooting DAGs | Cloud Composer
Some DAG executions issues might be caused by the Airflow scheduler not working ... web interface, check in the DAG's Graph View for...
Read more >
Release Notes - Apache Airflow documentation - Amazon AWS
Ability to clear a specific DAG Run's task instances via REST API (#23516) ... These custom fields appear in the web UI as...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found