Data is duplicated on reloading seeds that are using an external table
See original GitHub issueDescribe the bug
When seeds are configured as an external table, the data is getting duplicated on reload.
Steps To Reproduce
I create a repo with a demo of the issue: https://github.com/dejan/dbt-demo-inconsistent-seeds but here is a short instruction on how to reproduce:
Have seeds/cities.yml:
id,name
1,berlin
2,paris
Have seeds/countries.yml:
id,name
1,germany
2,france
Configure one seed to use managed table (by default) and another one to use external (by setting location_root).
seeds:
foo:
cities:
countries:
location_root: "{{ env_var('LOCATION') }}"
Run dbt seed twice:
dbt seed && dbt seed
Observe the content in both tables.
select * from foo.cities
id | name |
---|---|
1 | berlin |
2 | paris |
select * from foo.countries
id | name |
---|---|
1 | germany |
1 | germany |
2 | france |
2 | france |
Expected behavior
The behavior should be consistent regardless of the table type. The data should be reloaded - ie there should be no duplicates.
System information
The output of dbt --version
:
Core:
- installed: 1.1.0
- latest: 1.1.1 - Update available!
Your version of dbt-core is out of date!
You can find instructions for upgrading here:
https://docs.getdbt.com/docs/installation
Plugins:
- spark: 1.1.0 - Up to date!
- databricks: 1.1.0 - Up to date!
The operating system you’re using: OS X Big Sur
The output of python --version
: Python 3.8.10
Issue Analytics
- State:
- Created a year ago
- Comments:5 (1 by maintainers)
Top Results From Across the Web
Rerun dbt seed append data instead of refresh data if seed ...
Set location_root of seed so data is stored in external table. ... Data is duplicated on reloading seeds that are using an external...
Read more >How to solve the problem of duplicate imported data in ...
This blog introduces how to solve the problem of duplicate imported data in the external table. By default, if there is duplicate data...
Read more >Best practices and other techniques for using external tables
The rows will have unique row IDs, but the data will be duplicated. To fix this problem, you must delete the duplicate rows...
Read more >Import from .sql file Oracle SQL Developer excluding ...
1 Answer 1 · You can use the SQLLDR tool or external table to load · You can load your sql file into...
Read more >16.20 - Retaining Duplicate Rows Using the ALL Option
Unless you specify the ALL option, duplicate rows are eliminated from the final result. The ALL option retains duplicate rows for the result ......
Read more >Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start FreeTop Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Top GitHub Comments
@allisonwang-db we already have the workaround - we’re deleting the content with a pre-hook but obviously that’s not ideal as it it’s not atomic and it pollutes the version history.
External tables better fit our company’s data management strategy.
@dejan Ah, I think this is an issue we were aware of over in
dbt-spark
: https://github.com/dbt-labs/dbt-spark/issues/112There was a PR opened for it some months ago, but we didn’t manage to get it over the finish line: https://github.com/dbt-labs/dbt-spark/pull/182