Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

dbt runs slowly on Redshift RA3 nodes due to excessive commits

See original GitHub issue

Describe the bug

This is similar to #2748, but is Redshift-specific. However, I think users of other databases would benefit from slightly faster runs.

The next-gen Redshift RA3 nodes seem to have very slow COMMITs, due to the fact that:

Redshift only processes a single transaction/COMMIT at a time, and
RA3 nodes write their commits to S3; this makes each COMMIT take about a second (🤮)

Many dbt users have reported excessively slow runs creating views on RA3 nodes. In my experience, creating views while running dbt in multiple threads can take upwards of 20 seconds of clock time (see logs below).

There are at least 4 transactions as part of the view materialization:

Drop the intermediate relation
Drop the temp relation
Create new view; swap names with old view
Drop old view

Best case, on RA3, this sequence will take 4 seconds per view; in multiple threads, as these sequences stack up in the transaction queue, it will take much longer (in clock time).

Steps To Reproduce

On a Redshift cluster with RA3 nodes (I’m on RA3.xplus x4), execute dbt run in multiple threads in a project with many views.

Expected behavior

Views should ideally take at most a second or two to be materialized.

Above, the four transactions to create a view could be compressed into a single transaction. I don’t think this would have other side-effects, as any one of these operations failing would cause the run step to fail anyway.

To achieve this, I think we would have to modify the materialization macro(s) so dropping the other relations happens inside the same commit. It may be sufficient to add an explicit BEGIN in the materialization.

Potential workarounds

On dbt v0.18+, we can avoid rebuilding views using the new selection syntax. However, this introduces some complexity and the potential for deployment errors when views change.

As an alternative, we can use more ephemeral models and fewer views, if the views do not need to be exposed in other contexts; this can make debugging more difficult.

Screenshots and log output

Here’s a recent run in dbt cloud:

Investigating the logs, it’s clear the bottleneck is committing the transaction. This is each statement to execute the START view model analytics.logistics_ops_sectors line above:

Source for logs is svl_statementtext:

with 
    statements as (

        select 
            *,
            (datediff(milliseconds, starttime, endtime)::float / 1000)::numeric(10, 3) as dur_s,
            (datediff(milliseconds, lag(endtime, 1) over (order by starttime), starttime)::float /
            1000)::numeric(10, 3) as wait_s

    from svl_statementtext
    where 
        -- matches specific prod run above
        starttime >= '2021-02-17 08:04:17.124262'
        and starttime <= '2021-02-17 08:04:44.216802'
        and pid = 17738
        and sequence = 0

    order by starttime asc

),
    totals as (

        select
            null::int, 
            null::int, 
            null::int, 
            null::varchar, 
            getdate()::timestamp, 
            null::timestamp, 
            null::int, 
            null::varchar,
            ' --------- TOTAL ---------'::varchar,
            sum(dur_s), 
            sum(wait_s)

        from statements

 ),
    unioned as (

         select * from statements
         union all
         select * from totals

)
select * from unioned order by starttime

System information

Which database are you using dbt with?

postgres
redshift
bigquery
snowflake
other (specify: ____________)

The output of dbt --version:

0.17.1

The operating system you’re using: Windows, but reproduces on dbt cloud The output of python --version: 3.7.9

Issue Analytics

State:
Created 3 years ago
Reactions:7
Comments:7 (6 by maintainers)

Top GitHub Comments

1reaction

tconbeercommented, Aug 9, 2021

Yes, it’s still “slow” after this change, but not as bad… should be up to 50% reduction in clock time. Best case creating a view still requires 2 commits, which should take ~2 seconds without any other workloads on the cluster, but more likely 2-4x that if you’re running dbt in multiple threads or have concurrent workloads.

1reaction

tconbeercommented, Jun 11, 2021

Coming back to this with some fresh eyes. Instead of modifying the cascade or transaction behavior, I think a better solution is for the materialization to:

check the relational cache to see whether these temp relations exist,
try to drop ... cascade them only if they already exist. (no-op otherwise)

I think this would be simple and fast using adapter.get_relation on the “intermediate” and “backup” relation names.

This would introduce a race condition if there was another process modifying the same relations at roughly the same time. But that problem exists with the current implementation, since the drop is done before the temp relation is needed (in a different transaction). And I think if you have multiple processes running dbt on the same models at the same time, all bets are off, anyway.

We may as well make the same change in the table materialization, since the mechanics are the same.

Redshift and Postgres use the materializations from the global_project, so I propose we change it there; Snowflake and BQ have their own materializations.

I can open a PR with this change if we’re good on this solution.