question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Data is duplicated on reloading seeds that are using an external table

See original GitHub issue

Describe the bug

When seeds are configured as an external table, the data is getting duplicated on reload.

Steps To Reproduce

I create a repo with a demo of the issue: https://github.com/dejan/dbt-demo-inconsistent-seeds but here is a short instruction on how to reproduce:

Have seeds/cities.yml:

id,name
1,berlin
2,paris

Have seeds/countries.yml:

id,name
1,germany
2,france

Configure one seed to use managed table (by default) and another one to use external (by setting location_root).

seeds:
  foo:
    cities:
    countries:
      location_root: "{{ env_var('LOCATION') }}"

Run dbt seed twice:

dbt seed && dbt seed

Observe the content in both tables.

select * from foo.cities
id name
1 berlin
2 paris
select * from foo.countries
id name
1 germany
1 germany
2 france
2 france

Expected behavior

The behavior should be consistent regardless of the table type. The data should be reloaded - ie there should be no duplicates.

System information

The output of dbt --version:

Core:
  - installed: 1.1.0
  - latest:    1.1.1 - Update available!

  Your version of dbt-core is out of date!
  You can find instructions for upgrading here:
  https://docs.getdbt.com/docs/installation

Plugins:
  - spark:      1.1.0 - Up to date!
  - databricks: 1.1.0 - Up to date!

The operating system you’re using: OS X Big Sur

The output of python --version: Python 3.8.10

Issue Analytics

  • State:closed
  • Created a year ago
  • Comments:5 (1 by maintainers)

github_iconTop GitHub Comments

1reaction
dejancommented, Jun 20, 2022

@allisonwang-db we already have the workaround - we’re deleting the content with a pre-hook but obviously that’s not ideal as it it’s not atomic and it pollutes the version history.

External tables better fit our company’s data management strategy.

1reaction
jtcohen6commented, Jun 17, 2022

@dejan Ah, I think this is an issue we were aware of over in dbt-spark: https://github.com/dbt-labs/dbt-spark/issues/112

There was a PR opened for it some months ago, but we didn’t manage to get it over the finish line: https://github.com/dbt-labs/dbt-spark/pull/182

Read more comments on GitHub >

github_iconTop Results From Across the Web

Rerun dbt seed append data instead of refresh data if seed ...
Set location_root of seed so data is stored in external table. ... Data is duplicated on reloading seeds that are using an external...
Read more >
How to solve the problem of duplicate imported data in ...
This blog introduces how to solve the problem of duplicate imported data in the external table. By default, if there is duplicate data...
Read more >
Best practices and other techniques for using external tables
The rows will have unique row IDs, but the data will be duplicated. To fix this problem, you must delete the duplicate rows...
Read more >
Import from .sql file Oracle SQL Developer excluding ...
1 Answer 1 · You can use the SQLLDR tool or external table to load · You can load your sql file into...
Read more >
16.20 - Retaining Duplicate Rows Using the ALL Option
Unless you specify the ALL option, duplicate rows are eliminated from the final result. The ALL option retains duplicate rows for the result ......
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found