Using a timedelta object as a Schedule Interval with catchup=False causes the start_date to no longer be honored
See original GitHub issueApache Airflow version: 1.8 - 2.0.1 (tested against 1.10.4, 1.10.15, 2.0.1)
Kubernetes version (if you are using kubernetes) (use kubectl version
): N/A
Environment:
- Cloud provider or hardware configuration:
- OS (e.g. from /etc/os-release):
- Kernel (e.g.
uname -a
): - Install tools:
- Others: Python 2.7.16, 3.7.6 (I don’t think this is a factor)
What happened:
There is an issue with the scheduling of DAGs that use a timedelta
object as the DAG schedule_interval
argument while also having catchup
set to False
. What happens is that if you have a DAG that meets that criteria then when it’s turned on it will ignore the time component of the start date and just run immediately.
This was previously reported in [AIRFLOW-1156] and was closed with https://github.com/apache/airflow/pull/8776 which fixed the two dag runs problem that was also mentioned in that issue.
What you expected to happen:
I expect it to behave the same as a DAG using a cron expression for the schedule_interval
under otherwise same conditions (i.e. catchup
still set to False
).
I believe this is a result of how Dag#following_schedule
and Dag#previous_schedule
are implemented. I traced the SchedulerJob#create_dag_run
method and I believe this is due to the Dag
methods used in there.
How to reproduce it:
Create two dags with catchup
set to False
that are exactly the same except that one will use a timedelta
object as the schedule_interval
argument and the other will use a cron expression. Set a start_date
of sometime in the past. Turn them both on and you should see the one with a timedelta
as the schedule_interval
has disregarded the time part of the start_date
and used the current time when it started executing as the time part of the execution_date
. The version using the cron expression will have used the time from the cron expression.
Example DAG:
import datetime as dt
from airflow import DAG
from airflow.operators.dummy_operator import DummyOperator
dag_params = {
'dag_id': 'schedule_interval_bug_example_dag',
'default_args':{
'owner': 'Administrator',
'depends_on_past': False,
'retries': 0,
'email': ['example@example.com']
},
'schedule_interval': dt.timedelta(days=1),
'start_date': dt.datetime(year=2021, month=1, day=1, hour=11, minute=10),
'catchup': False
}
with DAG(**dag_params) as dag:
DummyOperator(task_id='start') >> DummyOperator(task_id='end')
For the cron version just change the schedule_interval
to 10 11 * * *
.
Here’s a screenshot of this happening on 2.0.1 (although the bug exists in much older versions as well). The expectation would be that the execution_date
displayed for both of the DAGs should have a time of 11:10:00.
Anything else we need to know:
I’ve only tested this on DAGs that have a 1 day schedule interval, but testing with other intervals could reveal if this is a problem at finer grained intervals or if it’s isolated to daily runs. Based on what I saw in Dag#following_schedule
and Dag#previous_schedule
I suspect this would be a problem with shorter intervals as well.
Tested with the SequentialExecutor
and StandardTaskRunner
, which I don’t think are a factor, but it’s certainly possible.
Happy to provide other details or help in any way.
Issue Analytics
- State:
- Created 2 years ago
- Reactions:4
- Comments:7 (4 by maintainers)
Top GitHub Comments
I just encountered this issue at work today. It turns out I cannot write a DAG that runs every 6 hours from 11:30. I tried with cron
0 */6 * * *
but of course that runs 00:00/06:00 etc (that’s expected behaviour). I switched to the obvious start_date + timedelta combo. But it turns out it just uses the DAG unpause time as the new start_date??? Now I have to go around and do stupid things like writing schedules like30 11,17,23,5 * * *
instead of just writingtimedelta(hours=6)
For an application which is basically a glorified scheduler, Airflow seems it’s not even doing a good job in that place too.
I don’t think it is a bug. When you use
schedule_interval=timedelta(minutes=5)
it just tell scheduler to run every 5 minutes. Now whencatchup=False
, it tells scheduler to run first as soon as it can and then every 5 mins from then onwards.This is the main difference between Cron vs Timedelta. While Cron does not take account of “last time”, timedetla is dependent on the “last time”. For example Cron
0 1 * * *
just says perform an action everyday at 1 am.