question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Rerun dbt seed append data instead of refresh data if seed is stored in external table

See original GitHub issue

Enviroment:

  • Spark Standalone cluster 2.4.5 with Thrift JDBC/ODBC server
  • dbt 0.18.0
  • dbt-spark 0.18.0

I use example project from https://github.com/fishtown-analytics/jaffle_shop. Set location_root of seed so data is stored in external table.

seeds:
  jaffle_shop:
      raw_orders:
        location_root: hdfs:///user/qsbao/

While repeat run dbt seed -s raw_orders, I found the number of records of table raw_orders continues to grow.

Found sql from log in a run:

drop table if exists dbt_alice.raw_orders

create table dbt_alice.raw_orders (id bigint,user_id bigint,order_date date,status string)
location 'hdfs:///user/qsbao/raw_orders'

insert into dbt_alice.raw_orders values
            (%s,%s,%s,%s),(%s,%s,%s,%s),(%s,%s,%s,%s)...

Note that the first sql drop table removes only the metadata and not the data itself as it is an external table.

Issue Analytics

  • State:closed
  • Created 3 years ago
  • Reactions:2
  • Comments:20 (6 by maintainers)

github_iconTop GitHub Comments

2reactions
jtcohen6commented, May 10, 2021

@mv1742 That’s a really good point! It would be a pretty simple change to accomplish our desired behavior around truncate-then-insert, assuming it works with external tables.

The only thing we’ll need to be careful of is, dbt splits very large seeds into 10k-row chunks. We’d only want the first chunk to execute an insert overwrite, the rest all insert into.

So perhaps in the seed materialization: https://github.com/fishtown-analytics/dbt-spark/blob/dff1b613ddf87e4e72e8a47475bcfd1d55796a5c/dbt/include/spark/macros/materializations/seed.sql#L6-L14

The last line there could become:

insert {{ "overwrite" if loop.first else "into" }} {{ this.render() }} values
2reactions
qsbaocommented, Oct 23, 2020

And after the solution is confirmed, I am happy to work on this.

Read more comments on GitHub >

github_iconTop Results From Across the Web

Rerun dbt seed append data instead of refresh data if ... - GitHub
0. I use example project from https://github.com/fishtown-analytics/jaffle_shop. Set location_root of seed so data is stored in external table.
Read more >
Seeds | dbt Developer Hub
Seeds are CSV files in your dbt project (typically in your seeds directory), that dbt can load into your data warehouse using the...
Read more >
ORA-24280 to ORA-28727 - Oracle Help Center
Action: Contact customer support. ORA-24365: error in character conversion. Cause: This usually occurs during conversion of a multibyte character data when the ...
Read more >
Replicating csv files to your Data Warehouse with dbt Seeds
dbt seed — full-refresh — The dbt seed command will load csv files located in the data-paths directory (by default, this can be...
Read more >
Oracle Database 19.11.0 dictionary changelog - DBA
28909992, 32545013, AFTER FAST REFRESH, MVIEW DATA IS DIFFERENT FROM TABLE DATA. 28914144, 32545013, [DBT-50001] UNABLE TO CHECK THE VALUE ...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found