Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Incremental Load Predicates To Bound unique_id scans

See original GitHub issue

Describe the feature

Hello! I work with a number of very large tables 8-60TB and growing daily. Data is loaded incrementally.

I often use the delete+insert incremental load strategy to ensure that the target table is duplicate free. Scan time on these large tables are often multi hour.

Below shows an incremental query using delete+insert executed against this large table in snowflake:

Below shows the detailed profile:

it takes a lot of resources to perform full table scans on mult-terrabyte tables.

Is it possible to add support for predicates on the incremental load sql?

I created a POC Pull Request to illustrate this in action. The incremental_predicates are defined as part of the config:

{{
    config(
      materialized='incremental',
      incremental_strategy='delete+insert',
      incremental_predicates=[
        "collector_hour >= dateadd('day', -7, CONVERT_TIMEZONE('UTC', current_timestamp()))"
      ],
      unique_key='unique_id'
    )
}}

THe image below shows the predicates applied to the incremental load:

The effect of bounding the incremental unique window are profound:

Of course, not every workload supports a bounded unique window, but we found it applicable for our use case.

Describe alternatives you’ve considered

I could think of a couple alternatives for this (none are dbt based):

Scale out database compute to handle full table scans in reasonable time periods
Upstream dedupers to guaranteee uniqueness before data is touched with dbt

Additional context

I believe all databases could benefit from optional support of incremental predicates.

Who will this benefit?

This should benefit any dbt users who have:

Multi-TB database deployments
Unique incremental delete+insert strategy
Queue based data source with limited retention (i.e. 7 days) - Since the datasources for these table are queue based we’re guaranteed not to see duplicates outside of a fixed window.

Are you interested in contributing this feature?

Yes! I would be happy to!

Issue Analytics

State:
Created 2 years ago
Reactions:11
Comments:7 (2 by maintainers)

Top GitHub Comments

2reactions

dm03514commented, Jun 2, 2021

WE’ve completely settled on the merge approach with predicates. We created a light macro to expose the predicates through the config for our models. The merge with predicates gracefully handles backfills (something that the bounded delete does not do 😕). I will create a PR for exposing the merge predicates to config for your feedback. Thank you ! 🚀

1reaction

oolongteacommented, May 4, 2022

not urgent but I’m definitely still interested in this, one pattern I’d like to use with a custom incremental_strategy is applying a lookback to the target table. example snowflake pseudo SQL:

merge into target
using (
   select id, a, b, c
   from events
) batch
on (target.id = batch.id
 and target.date > [35 days ago]
)
when matched ... update ...
when not matched ... insert ...

the use case is that the target table continues to grow larger and the batch has data within 30-ish days, so it’s safe to only look at the target for the same date range to avoid scanning the whole table

we already do this with Airflow, but it’s a pain to have to decide between Airflow-only vs dbt when the SQL is simple enough for dbt

looks like this is close with dave-connors-3’s PR https://github.com/dbt-labs/dbt-core/pull/4546 so I’ll keep an eye on that. thanks!

Top Results From Across the Web

Incremental models - dbt Developer Hub

A unique_key enables updating existing rows instead of just appending new rows. If new information arrives for an existing unique_key , that new...

17.10 - QueryBandTbl - Advanced SQL Engine - Teradata Database

Unique ID for the captured query and unique primary index for QueryBandTbl. SessionID, ID for the session from which the query was captured....

US7447666B2 - System and method for analyzing a pattern in a ...

An episode is a conjunction of events bound to a given variable and that ... The method includes scanning the temporal data sequence...

GNU Compiler Collection Internals - GCC, the GNU Compiler ...

7.2.6.3 Scan optimization dump files . ... 17.7.2 Defining Machine-Specific Predicates . ... Bounds N1 and N2 and the increment expression INCR.

(PDF) An XML query engine for network-bound data - ResearchGate

of operators that incrementally evaluate a query's input. path expressions as data is read. ... Selectionpredicates: Another useful capability in x-scan.