question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Incremental Load Predicates To Bound unique_id scans

See original GitHub issue

Describe the feature

Hello! I work with a number of very large tables 8-60TB and growing daily. Data is loaded incrementally.

Screen Shot 2021-04-24 at 3 25 15 PM

I often use the delete+insert incremental load strategy to ensure that the target table is duplicate free. Scan time on these large tables are often multi hour.

Below shows an incremental query using delete+insert executed against this large table in snowflake:

full_scan_query

Below shows the detailed profile:

scan_partitionss

it takes a lot of resources to perform full table scans on mult-terrabyte tables.

Is it possible to add support for predicates on the incremental load sql?

I created a POC Pull Request to illustrate this in action. The incremental_predicates are defined as part of the config:

{{
    config(
      materialized='incremental',
      incremental_strategy='delete+insert',
      incremental_predicates=[
        "collector_hour >= dateadd('day', -7, CONVERT_TIMEZONE('UTC', current_timestamp()))"
      ],
      unique_key='unique_id'
    )
}}

THe image below shows the predicates applied to the incremental load:

predicate_query

The effect of bounding the incremental unique window are profound:

predicate_profile

Of course, not every workload supports a bounded unique window, but we found it applicable for our use case.

Describe alternatives you’ve considered

I could think of a couple alternatives for this (none are dbt based):

  • Scale out database compute to handle full table scans in reasonable time periods
  • Upstream dedupers to guaranteee uniqueness before data is touched with dbt

Additional context

I believe all databases could benefit from optional support of incremental predicates.

Who will this benefit?

This should benefit any dbt users who have:

  • Multi-TB database deployments
  • Unique incremental delete+insert strategy
  • Queue based data source with limited retention (i.e. 7 days) - Since the datasources for these table are queue based we’re guaranteed not to see duplicates outside of a fixed window.

Are you interested in contributing this feature?

Yes! I would be happy to!

Issue Analytics

  • State:open
  • Created 2 years ago
  • Reactions:11
  • Comments:7 (2 by maintainers)

github_iconTop GitHub Comments

2reactions
dm03514commented, Jun 2, 2021

WE’ve completely settled on the merge approach with predicates. We created a light macro to expose the predicates through the config for our models. The merge with predicates gracefully handles backfills (something that the bounded delete does not do 😕). I will create a PR for exposing the merge predicates to config for your feedback. Thank you ! 🚀

1reaction
oolongteacommented, May 4, 2022

not urgent but I’m definitely still interested in this, one pattern I’d like to use with a custom incremental_strategy is applying a lookback to the target table. example snowflake pseudo SQL:

merge into target
using (
   select id, a, b, c
   from events
) batch
on (target.id = batch.id
 and target.date > [35 days ago]
)
when matched ... update ...
when not matched ... insert ...

the use case is that the target table continues to grow larger and the batch has data within 30-ish days, so it’s safe to only look at the target for the same date range to avoid scanning the whole table

we already do this with Airflow, but it’s a pain to have to decide between Airflow-only vs dbt when the SQL is simple enough for dbt

looks like this is close with dave-connors-3’s PR https://github.com/dbt-labs/dbt-core/pull/4546 so I’ll keep an eye on that. thanks!

Read more comments on GitHub >

github_iconTop Results From Across the Web

Incremental models - dbt Developer Hub
A unique_key enables updating existing rows instead of just appending new rows. If new information arrives for an existing unique_key , that new...
Read more >
17.10 - QueryBandTbl - Advanced SQL Engine - Teradata Database
Unique ID for the captured query and unique primary index for QueryBandTbl. SessionID, ID for the session from which the query was captured....
Read more >
US7447666B2 - System and method for analyzing a pattern in a ...
An episode is a conjunction of events bound to a given variable and that ... The method includes scanning the temporal data sequence...
Read more >
GNU Compiler Collection Internals - GCC, the GNU Compiler ...
7.2.6.3 Scan optimization dump files . ... 17.7.2 Defining Machine-Specific Predicates . ... Bounds N1 and N2 and the increment expression INCR.
Read more >
(PDF) An XML query engine for network-bound data - ResearchGate
of operators that incrementally evaluate a query's input. path expressions as data is read. ... Selectionpredicates: Another useful capability in x-scan.
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found