Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Custom schemas: table already exists

See original GitHub issue

Issues with re-running workflows when using custom schemas.

When I create a model with a custom schema configured:

-- models/clean/clean_accounts.sql
{{ config(alias='accounts', schema='clean', materialization='table') }}
select * from {{ source('incoming', 'accounts') }}

I am able to run the workflow successfully once:

> dbt run
...
Completed successfully

However, if I run the same workflow again I get an error:

> dbt run
...
Runtime Error in model clean_orders (models/clean/clean_accounts.sql)
  Database Error
    org.apache.spark.sql.AnalysisException: `dev_clean`.`accounts` already exists.;

Instead, the table should be dropped and recreated. If we repeat the same exercise without the schema='clean' configuration, everything works as expected.

Issue Analytics

State:
Created 4 years ago
Comments:7 (4 by maintainers)

Top GitHub Comments

2reactions

drewbanincommented, Dec 17, 2019

hey @eamontaaffe - thanks for your thoughtful writeup here! I appreciate your patience - it was hard to get back in the swing of the dbt-spark plugin, but I’m excited to get this (and the other open PRs in this repo) merged!

I think the change you’ve proposed here is uncontroversial - let me pick this up with you in the open PR.

1reaction

jtcohen6commented, Feb 4, 2020

In the spirit of figuring out what was actually going wrong with adapter.get_relation, I discovered the cause: in Spark, unlike in other dbt adapters, database and schema are one and the same. Only the schema property of the materialization is updated, however, when a custom schema is declared in a model config. When dbt checks the cache here for a table matching both the database and schema of the model, it supplies the custom schema for schema but the default (target.database) for database.

I think we should fix get_relation, rather than the workaround in #42. We could redefine all get_relation calls to look like

{%- set old_relation = adapter.get_relation(database=schema, schema=schema, identifier=identifier) -%}

Or we could re-implement cache.get_relations for the Spark adapter to only check for a matching schema. I’m leaning toward the latter, what do you think @drewbanin?