Rerun dbt seed append data instead of refresh data if seed is stored in external table
See original GitHub issueEnviroment:
- Spark Standalone cluster 2.4.5 with Thrift JDBC/ODBC server
- dbt 0.18.0
- dbt-spark 0.18.0
I use example project from https://github.com/fishtown-analytics/jaffle_shop.
Set location_root of seed so data is stored in external table.
seeds:
jaffle_shop:
raw_orders:
location_root: hdfs:///user/qsbao/
While repeat run dbt seed -s raw_orders, I found the number of records of table raw_orders continues to grow.
Found sql from log in a run:
drop table if exists dbt_alice.raw_orders
create table dbt_alice.raw_orders (id bigint,user_id bigint,order_date date,status string)
location 'hdfs:///user/qsbao/raw_orders'
insert into dbt_alice.raw_orders values
(%s,%s,%s,%s),(%s,%s,%s,%s),(%s,%s,%s,%s)...
Note that the first sql drop table removes only the metadata and not the data itself as it is an external table.
Issue Analytics
- State:
- Created 3 years ago
- Reactions:2
- Comments:20 (6 by maintainers)
Top Results From Across the Web
Rerun dbt seed append data instead of refresh data if ... - GitHub
0. I use example project from https://github.com/fishtown-analytics/jaffle_shop. Set location_root of seed so data is stored in external table.
Read more >Seeds | dbt Developer Hub
Seeds are CSV files in your dbt project (typically in your seeds directory), that dbt can load into your data warehouse using the...
Read more >ORA-24280 to ORA-28727 - Oracle Help Center
Action: Contact customer support. ORA-24365: error in character conversion. Cause: This usually occurs during conversion of a multibyte character data when the ...
Read more >Replicating csv files to your Data Warehouse with dbt Seeds
dbt seed — full-refresh — The dbt seed command will load csv files located in the data-paths directory (by default, this can be...
Read more >Oracle Database 19.11.0 dictionary changelog - DBA
28909992, 32545013, AFTER FAST REFRESH, MVIEW DATA IS DIFFERENT FROM TABLE DATA. 28914144, 32545013, [DBT-50001] UNABLE TO CHECK THE VALUE ...
Read more >
Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free
Top Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found

@mv1742 That’s a really good point! It would be a pretty simple change to accomplish our desired behavior around truncate-then-insert, assuming it works with external tables.
The only thing we’ll need to be careful of is, dbt splits very large seeds into 10k-row chunks. We’d only want the first chunk to execute an
insert overwrite, the rest allinsert into.So perhaps in the
seedmaterialization: https://github.com/fishtown-analytics/dbt-spark/blob/dff1b613ddf87e4e72e8a47475bcfd1d55796a5c/dbt/include/spark/macros/materializations/seed.sql#L6-L14The last line there could become:
And after the solution is confirmed, I am happy to work on this.