Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Rerun dbt seed append data instead of refresh data if seed is stored in external table

See original GitHub issue

Enviroment:

Spark Standalone cluster 2.4.5 with Thrift JDBC/ODBC server
dbt 0.18.0
dbt-spark 0.18.0

I use example project from https://github.com/fishtown-analytics/jaffle_shop. Set location_root of seed so data is stored in external table.

seeds:
  jaffle_shop:
      raw_orders:
        location_root: hdfs:///user/qsbao/

While repeat run dbt seed -s raw_orders, I found the number of records of table raw_orders continues to grow.

Found sql from log in a run:

drop table if exists dbt_alice.raw_orders

create table dbt_alice.raw_orders (id bigint,user_id bigint,order_date date,status string)
location 'hdfs:///user/qsbao/raw_orders'

insert into dbt_alice.raw_orders values
            (%s,%s,%s,%s),(%s,%s,%s,%s),(%s,%s,%s,%s)...

Note that the first sql drop table removes only the metadata and not the data itself as it is an external table.

Issue Analytics

State:
Created 3 years ago
Reactions:2
Comments:20 (6 by maintainers)

Top GitHub Comments

2reactions

jtcohen6commented, May 10, 2021

@mv1742 That’s a really good point! It would be a pretty simple change to accomplish our desired behavior around truncate-then-insert, assuming it works with external tables.

The only thing we’ll need to be careful of is, dbt splits very large seeds into 10k-row chunks. We’d only want the first chunk to execute an insert overwrite, the rest all insert into.

So perhaps in the seed materialization: https://github.com/fishtown-analytics/dbt-spark/blob/dff1b613ddf87e4e72e8a47475bcfd1d55796a5c/dbt/include/spark/macros/materializations/seed.sql#L6-L14

The last line there could become:

insert {{ "overwrite" if loop.first else "into" }} {{ this.render() }} values

2reactions

qsbaocommented, Oct 23, 2020

And after the solution is confirmed, I am happy to work on this.

Top Results From Across the Web

Rerun dbt seed append data instead of refresh data if ... - GitHub

0. I use example project from https://github.com/fishtown-analytics/jaffle_shop. Set location_root of seed so data is stored in external table.

Seeds | dbt Developer Hub

Seeds are CSV files in your dbt project (typically in your seeds directory), that dbt can load into your data warehouse using the...

ORA-24280 to ORA-28727 - Oracle Help Center

Action: Contact customer support. ORA-24365: error in character conversion. Cause: This usually occurs during conversion of a multibyte character data when the ...

Replicating csv files to your Data Warehouse with dbt Seeds

dbt seed — full-refresh — The dbt seed command will load csv files located in the data-paths directory (by default, this can be...

Oracle Database 19.11.0 dictionary changelog - DBA

28909992, 32545013, AFTER FAST REFRESH, MVIEW DATA IS DIFFERENT FROM TABLE DATA. 28914144, 32545013, [DBT-50001] UNABLE TO CHECK THE VALUE ...