question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Using a timedelta object as a Schedule Interval with catchup=False causes the start_date to no longer be honored

See original GitHub issue

Apache Airflow version: 1.8 - 2.0.1 (tested against 1.10.4, 1.10.15, 2.0.1)

Kubernetes version (if you are using kubernetes) (use kubectl version): N/A

Environment:

  • Cloud provider or hardware configuration:
  • OS (e.g. from /etc/os-release):
  • Kernel (e.g. uname -a):
  • Install tools:
  • Others: Python 2.7.16, 3.7.6 (I don’t think this is a factor)

What happened:

There is an issue with the scheduling of DAGs that use a timedelta object as the DAG schedule_interval argument while also having catchup set to False. What happens is that if you have a DAG that meets that criteria then when it’s turned on it will ignore the time component of the start date and just run immediately.

This was previously reported in [AIRFLOW-1156] and was closed with https://github.com/apache/airflow/pull/8776 which fixed the two dag runs problem that was also mentioned in that issue.

What you expected to happen:

I expect it to behave the same as a DAG using a cron expression for the schedule_interval under otherwise same conditions (i.e. catchup still set to False).

I believe this is a result of how Dag#following_schedule and Dag#previous_schedule are implemented. I traced the SchedulerJob#create_dag_run method and I believe this is due to the Dag methods used in there.

How to reproduce it:

Create two dags with catchup set to False that are exactly the same except that one will use a timedelta object as the schedule_interval argument and the other will use a cron expression. Set a start_date of sometime in the past. Turn them both on and you should see the one with a timedelta as the schedule_interval has disregarded the time part of the start_date and used the current time when it started executing as the time part of the execution_date. The version using the cron expression will have used the time from the cron expression.

Example DAG:

import datetime as dt

from airflow import DAG
from airflow.operators.dummy_operator import DummyOperator

dag_params = {
    'dag_id': 'schedule_interval_bug_example_dag',
    'default_args':{
        'owner': 'Administrator',
        'depends_on_past': False,
        'retries': 0,
        'email': ['example@example.com']
    },
    'schedule_interval': dt.timedelta(days=1),
    'start_date': dt.datetime(year=2021, month=1, day=1, hour=11, minute=10),
    'catchup': False
}

with DAG(**dag_params) as dag:
    DummyOperator(task_id='start') >> DummyOperator(task_id='end')

For the cron version just change the schedule_interval to 10 11 * * *.

Here’s a screenshot of this happening on 2.0.1 (although the bug exists in much older versions as well). The expectation would be that the execution_date displayed for both of the DAGs should have a time of 11:10:00.

Screen Shot 2021-03-23 at 5 54 53 PM

Anything else we need to know:

I’ve only tested this on DAGs that have a 1 day schedule interval, but testing with other intervals could reveal if this is a problem at finer grained intervals or if it’s isolated to daily runs. Based on what I saw in Dag#following_schedule and Dag#previous_schedule I suspect this would be a problem with shorter intervals as well.

Tested with the SequentialExecutor and StandardTaskRunner, which I don’t think are a factor, but it’s certainly possible.

Happy to provide other details or help in any way.

Issue Analytics

  • State:closed
  • Created 2 years ago
  • Reactions:4
  • Comments:7 (4 by maintainers)

github_iconTop GitHub Comments

2reactions
KayleMastercommented, Mar 29, 2021

I just encountered this issue at work today. It turns out I cannot write a DAG that runs every 6 hours from 11:30. I tried with cron 0 */6 * * * but of course that runs 00:00/06:00 etc (that’s expected behaviour). I switched to the obvious start_date + timedelta combo. But it turns out it just uses the DAG unpause time as the new start_date??? Now I have to go around and do stupid things like writing schedules like 30 11,17,23,5 * * * instead of just writing timedelta(hours=6)
For an application which is basically a glorified scheduler, Airflow seems it’s not even doing a good job in that place too. image

2reactions
kaxilcommented, Mar 24, 2021

I don’t think it is a bug. When you use schedule_interval=timedelta(minutes=5) it just tell scheduler to run every 5 minutes. Now when catchup=False, it tells scheduler to run first as soon as it can and then every 5 mins from then onwards.

This is the main difference between Cron vs Timedelta. While Cron does not take account of “last time”, timedetla is dependent on the “last time”. For example Cron 0 1 * * * just says perform an action everyday at 1 am.

Read more comments on GitHub >

github_iconTop Results From Across the Web

DAGs — Airflow Documentation
If schedule is not enough to express the DAG's schedule, see Timetables. For more information on logical date , see Data Interval and...
Read more >
5 Cold Knowledge Points About Python Timedelta
The timedelta object will simple become zero, which means that there is literally no time span in this “interval”. This is definitely true...
Read more >
Release Notes - Apache Airflow documentation - Amazon AWS
We added new DAG argument schedule that can accept a cron expression, timedelta object, timetable object, or list of dataset objects.
Read more >
How to Use timedelta Objects in Python to Work with Dates
In this post, we'll see how we can use the timedelta object in the datetime module. It denotes a span of time and...
Read more >
Airflow Documentation - Read the Docs
3.21.10 How to reduce airflow dag scheduling latency in production? ... Airflow is not in the Spark Streaming or Storm space, it is...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found