Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

dbt models that are parquet files

See original GitHub issue

I haven’t thought through this deeply. It might not make sense, it might require changes to dbt, or it might already work? But I wanted to raise it just in case, because it would help me out with something I’m building.

Could we enable configuring dbt-duckdb such that

select
    bla
from {{ ref("orders") }}

compiles to

select
    bla
from 's3://bucket/orders.parquet'

Issue Analytics

State:
Created a year ago
Comments:14 (6 by maintainers)

Top GitHub Comments

1reaction

jwillscommented, Oct 23, 2022

I wired up a version of this idea here: https://github.com/jwills/dbt-duckdb/compare/jwills_file_based_dbs?expand=1

I got things working, but I didn’t feel great about it. My “idea” was to take advantage of the fact that the database config parameter is a no-op for DuckDB so that if you specify a path instead of the default (main), I do some hacks to treat any models under that database + schema path as parquet/CSV files-- including when you ref them in other models (so for example a ref’d model that uses the parquet materialization will be rendered when it is queried as <database>/<schema>/<model>.parquet instead of database.schema.model.

I like what @tomsej is saying better tho (i.e., the parquet materialization acts like a view over a parquet file, where the location of where the parquet file(s) should be materialized is specified…somewhere?), b/c it keeps the metadata catalog where it belongs-- inside of DuckDB, and not externally managed via dbt-duckdb + the filesystem. Radek’s approach means that we don’t have to jump through a whole bunch of hoops inside of the DuckDBAdapter and DuckDBRelation classes (as I do in the above branch) to render the relation differently when it’s a parquet/csv file instead of a regular table/view. To me, that makes the parquet materialization into some syntactic sugar that is equivalent to materializing a view model that has a post-hook which does the COPY (SELECT * FROM {{ this }}) TO '/path/to/output.parquet' for us, and I would be 👍 for such a materialization (and maybe a csv one that did the same sort of thing if we were so inclined?)

1reaction

tomsejcommented, Oct 23, 2022

I was thinking about the possible solutions for this too. Think this is only viable for table materializations:

incrementals and snapshots are not possible, since parquet does not support updates (unlike iceberg or delta).
views are database objects so does not make sense
seeds are already outside the database

I was thinking about introducing a new type of materialization (or adding some extra parameters to the current), e.g. parquet. With that, the last steps (usually something like CREATE TABLE ...) of the table materializations would be instead COPY (SELECT ...) TO '<location parameter>.parquet' and CREATE VIEW AS SELECT FROM <location parameter>.parquet so the table would not be actual table but a view on the parquet. Think this is similar to what @AlexanderVR is doing. Any thoughts?