question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

RFC: Merging backfills

See original GitHub issue

Description Make it possible to merge multiple backfills into a single run, by extending the start_date of a single dagrun to cover a time period inclusive of all backfills.

Use case / motivation There are cases where running multiple backfills is less efficient than having a single run, for example where tasks in successive runs would do duplicate work.

An example:

  • We have a dag which runs every 6 hours, and processes batches of messages from the previous 6 hours by looking at the execution_date and the next_execution_date macro.
  • This dag has a task which launches a scan across a very large HBase table looking for matching rows to apply these messages to. The scan takes the same amount of time regardless of the batch size. The scan is the most time-consuming part of the dagrun (let’s say it takes 3 out of 4 hours for an average dagrun).
  • An external error causes 3 successive dagruns to fail.

At this point we have 18 hours of data to catch up on. Assuming the external issue has been fixed, this would take on average 12 hours to process, meaning further delays to processing future jobs. If instead we could merge these runs into a single backfill, this would reduce the processing time from 12 hours to something like 6 hours, greatly reducing the impact of delayed processing and also resource usage on Airflow and HBase (in this case, but in general other external services).

This issue of inefficient processing is one that I (and I’m sure others) have a need to solve. There are obviously other workarounds one could do but I don’t think they are correct in the sense of Airflow good practices. For example:

  • Temporarily alter the schedule interval to cover the desired range.
  • Introduce an override in the Airflow variables to make the next run process X batches.
  • Temporarily alter the dag code.
  • Run the dag tasks manually and externally to airflow, with the desired parameters.

All of these have their own pitfalls and invariably involve some other manual intervention in Airflow to ensure the database is kept accurate and/or future runs aren’t affected.

If there is some other solution to this problem that I am unaware of, please let me know. I have raised this as an RFC as any change that implements this feature would touch many areas of the code base, so would require some planning.

Issue Analytics

  • State:closed
  • Created 3 years ago
  • Reactions:1
  • Comments:6 (4 by maintainers)

github_iconTop GitHub Comments

1reaction
ashbcommented, Nov 27, 2020

Yes, this would be a good feature!

What most people seem to do right now as a work around is to have a special “backfill dag” that does the batching.

We (collectively) will need to spend some time designing an interface for this, and then likely raise it as an Airflow Improvment Proposal https://cwiki.apache.org/confluence/display/AIRFLOW/Airflow+Improvements+Proposals

I’ll happily help you with this process.

0reactions
potiukcommented, Jun 4, 2022

Very old. If this is still an issue, let’s move it into a discussion.

Read more comments on GitHub >

github_iconTop Results From Across the Web

[GitHub] [airflow] turbaszek commented on issue #12654: RFC ...
[GitHub] [airflow] turbaszek commented on issue #12654: RFC: Merging backfills ... I wonder if we could mix it with triggering backfill externally #11302 ......
Read more >
RFC 6285 - Unicast-Based Rapid Acquisition of Multicast RTP ...
If the receiver had been participating in another multicast session before joining the new session, it needs to send a Leave message to...
Read more >
RFC: Policies for Swift Platform Development
As such, it is often required that we have features that are not yet fully ready to be merged in order to enable...
Read more >
"Fossies" - the Fresh Open Source Software Archive
This is done by redirecting foreground writes for transaction run during the backfill to a separate temporary index, then copying from that ...
Read more >
Apache Hudi - The Data Lake Platform
See RFC-27 to track the design process and get involved. ... such custom merge logic during replaying/backfilling older data onto a table, ...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found