question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Improve Deduplicate Macro to use QUALIFY

See original GitHub issue

Describe the feature

The deduplicate macro currently uses a combination of dbt_utils.star and a subquery to work around needing to filter based on the result of a window function but not wanting to return the filtering column used. The QUALIFY keyword, recently introduced in Snowflake and BigQuery, allows for filtering the result of a query directly on a window function in a cleaner way.

This current code:

    select
        {{ dbt_utils.star(relation, relation_alias='deduped') | indent }}
    from (
        select
            _inner.*,
            row_number() over (
                partition by {{ group_by }}
                {% if order_by is not none -%}
                order by {{ order_by }}
                {%- endif %}
            ) as rn
        from {{ relation if relation_alias is none else relation_alias }} as _inner
    ) as deduped
    where deduped.rn = 1

I think could look like this:

       select
            *
        from {{ relation if relation_alias is none else relation_alias }} deduped
        qualify
            row_number() over (
                partition by {{ group_by }}
                {% if order_by is not none -%}
                order by {{ order_by }}
                {%- endif %}
            ) = 1

Additional context

Although BQ does support qualify, it also has issues with window functions with too much data choking on single nodes, hence why the BQ override for the macro uses array_agg instead. And Redshift and potentially other databases don’t support QUALIFY. But this could at least be overridden more cleanly and probably more performantly for Snowflake.

Who will this benefit?

What kind of use case will this feature be useful for? Please be specific and provide examples, this will help us prioritize properly.

Are you interested in contributing this feature?

Possibly, depends on time and what the dbt_utils build process looks like these days? Last time I submitted a fix a few months ago it definitely required some help from the team in troubleshooting the build process.

Issue Analytics

  • State:closed
  • Created a year ago
  • Comments:5 (5 by maintainers)

github_iconTop GitHub Comments

1reaction
judahrandcommented, Apr 14, 2022

@codigo-ergo-sum This isn’t entirely relevant to this issue but having taken a look at this for other DBs as well I’ve come to the conclusion that SQL dialects are even more annoying and limiting than I’d thought. The actual ANSI way of doing this deduplication is so rarely supported that I didn’t even know it existed! As far as I can tell only Trino and PG13+ actually support it (maybe others). The ANSI way of doing this dedupe is:

select *
from {{ relation if relation_alias is none else relation_alias }}
order by row_number() over (
    partition by {{ group_by }}
    order by {{ order_by }}
) fetch first row with ties

And even though Snowflake explicitly claims to support the ANSI FETCH syntax here… They don’t support WITH TIES!!!

0reactions
dbeatty10commented, May 17, 2022

This can be closed now that https://github.com/dbt-labs/dbt-utils/pull/548 has been merged, I believe.

Thank you for calling this out @judahrand ! Added “Resolves <span>#</span>543” as a comment into #548 for traceability and manually closing this issue.

Read more comments on GitHub >

github_iconTop Results From Across the Web

I created a VBA excel macro for finding duplicate values, how ...
Try Set vRange1 = Range("A1:A" & Cells(Rows.Count, 1).End(xlUp).Row) . You should fully qualify your ranges (include worksheet at least)...
Read more >
Optimize VBA code with top 25 performance improvement tips
VBA code generated from macro recording has low performance efficiency because it uses a lot of Select and Selection methods. 11. Avoid unnecessary...
Read more >
Excel VBA Copy Paste: The Complete Tutorial And 8 Examples
This tutorial explains the VBA methods and properties you can use to copy and paste cells in Excel. It includes 8 examples of...
Read more >
How to Avoid the Select Method in VBA & Why - Excel Campus
I've recorded a macro to copy and paste several times to create a grid of information. The initial macro recording is done in...
Read more >
Highlight Duplicate Rows in Excel using VBA - YouTube
Excel Macros & VBA - Tutorial for Beginners · Excel VBA - Delete Duplicate Rows · Create an Invoice Tracker in Excel |...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found