Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Data is duplicated on reloading seeds that are using an external table

See original GitHub issue

Describe the bug

When seeds are configured as an external table, the data is getting duplicated on reload.

Steps To Reproduce

I create a repo with a demo of the issue: https://github.com/dejan/dbt-demo-inconsistent-seeds but here is a short instruction on how to reproduce:

Have seeds/cities.yml:

id,name
1,berlin
2,paris

Have seeds/countries.yml:

id,name
1,germany
2,france

Configure one seed to use managed table (by default) and another one to use external (by setting location_root).

seeds:
  foo:
    cities:
    countries:
      location_root: "{{ env_var('LOCATION') }}"

Run dbt seed twice:

dbt seed && dbt seed

Observe the content in both tables.

select * from foo.cities

id	name
1	berlin
2	paris

select * from foo.countries

id	name
1	germany
1	germany
2	france
2	france

Expected behavior

The behavior should be consistent regardless of the table type. The data should be reloaded - ie there should be no duplicates.

System information

The output of dbt --version:

Core:
  - installed: 1.1.0
  - latest:    1.1.1 - Update available!

  Your version of dbt-core is out of date!
  You can find instructions for upgrading here:
  https://docs.getdbt.com/docs/installation

Plugins:
  - spark:      1.1.0 - Up to date!
  - databricks: 1.1.0 - Up to date!

The operating system you’re using: OS X Big Sur

The output of python --version: Python 3.8.10

Issue Analytics

State:
Created a year ago
Comments:5 (1 by maintainers)

Top GitHub Comments

1reaction

dejancommented, Jun 20, 2022

@allisonwang-db we already have the workaround - we’re deleting the content with a pre-hook but obviously that’s not ideal as it it’s not atomic and it pollutes the version history.

External tables better fit our company’s data management strategy.

1reaction

jtcohen6commented, Jun 17, 2022

@dejan Ah, I think this is an issue we were aware of over in dbt-spark: https://github.com/dbt-labs/dbt-spark/issues/112

There was a PR opened for it some months ago, but we didn’t manage to get it over the finish line: https://github.com/dbt-labs/dbt-spark/pull/182