question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

dbt models that are parquet files

See original GitHub issue

I haven’t thought through this deeply. It might not make sense, it might require changes to dbt, or it might already work? But I wanted to raise it just in case, because it would help me out with something I’m building.

Could we enable configuring dbt-duckdb such that

select
    bla
from {{ ref("orders") }}

compiles to

select
    bla
from 's3://bucket/orders.parquet'

?

Issue Analytics

  • State:closed
  • Created a year ago
  • Comments:14 (6 by maintainers)

github_iconTop GitHub Comments

1reaction
jwillscommented, Oct 23, 2022

I wired up a version of this idea here: https://github.com/jwills/dbt-duckdb/compare/jwills_file_based_dbs?expand=1

I got things working, but I didn’t feel great about it. My “idea” was to take advantage of the fact that the database config parameter is a no-op for DuckDB so that if you specify a path instead of the default (main), I do some hacks to treat any models under that database + schema path as parquet/CSV files-- including when you ref them in other models (so for example a ref’d model that uses the parquet materialization will be rendered when it is queried as <database>/<schema>/<model>.parquet instead of database.schema.model.

I like what @tomsej is saying better tho (i.e., the parquet materialization acts like a view over a parquet file, where the location of where the parquet file(s) should be materialized is specified…somewhere?), b/c it keeps the metadata catalog where it belongs-- inside of DuckDB, and not externally managed via dbt-duckdb + the filesystem. Radek’s approach means that we don’t have to jump through a whole bunch of hoops inside of the DuckDBAdapter and DuckDBRelation classes (as I do in the above branch) to render the relation differently when it’s a parquet/csv file instead of a regular table/view. To me, that makes the parquet materialization into some syntactic sugar that is equivalent to materializing a view model that has a post-hook which does the COPY (SELECT * FROM {{ this }}) TO '/path/to/output.parquet' for us, and I would be 👍 for such a materialization (and maybe a csv one that did the same sort of thing if we were so inclined?)

1reaction
tomsejcommented, Oct 23, 2022

I was thinking about the possible solutions for this too. Think this is only viable for table materializations:

  • incrementals and snapshots are not possible, since parquet does not support updates (unlike iceberg or delta).
  • views are database objects so does not make sense
  • seeds are already outside the database

I was thinking about introducing a new type of materialization (or adding some extra parameters to the current), e.g. parquet. With that, the last steps (usually something like CREATE TABLE ...) of the table materializations would be instead COPY (SELECT ...) TO '<location parameter>.parquet' and CREATE VIEW AS SELECT FROM <location parameter>.parquet so the table would not be actual table but a view on the parquet. Think this is similar to what @AlexanderVR is doing. Any thoughts?

Read more comments on GitHub >

github_iconTop Results From Across the Web

Using external parquet tables in a DBT pipeline - Stack Overflow
I'm trying to set up a simple DBT pipeline that uses a parquet tables stored on Azure Data Lake Storage and creates another...
Read more >
Apache Spark configurations | dbt Developer Hub
The file format to use when creating tables ( parquet , delta , hudi , csv , json , text , jdbc ,...
Read more >
Can DBT write to local parquet files? : r/dataengineering - Reddit
Hi - I could not find the answer to this - but can dbt base its data warehouse around parquet files on local...
Read more >
Parquet Files ETL | Open-source Data Integration - Airbyte
The Airbyte Parquet Files ELT data integration connector will replicate your Parquet Files to your data warehouse, data lake or database.
Read more >
jwills/dbt-duckdb - GitHub
It is crazy fast and allows you to read and write data stored in CSV and Parquet files directly, without requiring you to...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found