Incremental Load Predicates To Bound unique_id scans
See original GitHub issueDescribe the feature
Hello! I work with a number of very large tables 8-60TB and growing daily. Data is loaded incrementally.

I often use the delete+insert incremental load strategy to ensure that the target table is duplicate free. Scan time on these large tables are often multi hour.
Below shows an incremental query using delete+insert executed against this large table in snowflake:

Below shows the detailed profile:

it takes a lot of resources to perform full table scans on mult-terrabyte tables.
Is it possible to add support for predicates on the incremental load sql?
I created a POC Pull Request to illustrate this in action. The incremental_predicates are defined as part of the config:
{{
config(
materialized='incremental',
incremental_strategy='delete+insert',
incremental_predicates=[
"collector_hour >= dateadd('day', -7, CONVERT_TIMEZONE('UTC', current_timestamp()))"
],
unique_key='unique_id'
)
}}
THe image below shows the predicates applied to the incremental load:

The effect of bounding the incremental unique window are profound:

Of course, not every workload supports a bounded unique window, but we found it applicable for our use case.
Describe alternatives you’ve considered
I could think of a couple alternatives for this (none are dbt based):
- Scale out database compute to handle full table scans in reasonable time periods
- Upstream dedupers to guaranteee uniqueness before data is touched with dbt
Additional context
I believe all databases could benefit from optional support of incremental predicates.
Who will this benefit?
This should benefit any dbt users who have:
- Multi-TB database deployments
- Unique incremental delete+insert strategy
- Queue based data source with limited retention (i.e. 7 days) - Since the datasources for these table are queue based we’re guaranteed not to see duplicates outside of a fixed window.
Are you interested in contributing this feature?
Yes! I would be happy to!
Issue Analytics
- State:
- Created 2 years ago
- Reactions:11
- Comments:7 (2 by maintainers)
WE’ve completely settled on the merge approach with predicates. We created a light macro to expose the predicates through the
config
for our models. The merge with predicates gracefully handles backfills (something that the bounded delete does not do 😕). I will create a PR for exposing the merge predicates to config for your feedback. Thank you ! 🚀not urgent but I’m definitely still interested in this, one pattern I’d like to use with a custom incremental_strategy is applying a lookback to the target table. example snowflake pseudo SQL:
the use case is that the target table continues to grow larger and the batch has data within 30-ish days, so it’s safe to only look at the target for the same date range to avoid scanning the whole table
we already do this with Airflow, but it’s a pain to have to decide between Airflow-only vs dbt when the SQL is simple enough for dbt
looks like this is close with dave-connors-3’s PR https://github.com/dbt-labs/dbt-core/pull/4546 so I’ll keep an eye on that. thanks!