Parquet materialization
See original GitHub issue👋
Hey!
Something I’ve been thinking about is using parquet as a materialization. To be specific, one where dbt-duckdb
would use the underlying duckdb connection as a shim to edit parquet rather than adding some tables into the .duckdb
file.
I’m not sure if it’s possible to override what ref
does in an adapter, but this would roughly need to do two things:
- Make a new materialization that just took the query for some model and executes
copy (select ...) to 'model_name.parquet' (format parquet)
rather thancreate table model_name as (select ...)
for parquet materialized models. - Update ref to use
read_parquet(ref'd model name)
rather than schema.table_name if the model being referenced was materialized as parquet.
Would love to hear what other people think / if this would be useful!
This is loosely related to: https://github.com/jwills/dbt-duckdb/issues/15
Issue Analytics
- State:
- Created a year ago
- Reactions:3
- Comments:47 (14 by maintainers)
Top Results From Across the Web
Materialized Column: An Efficient Way to Optimize Queries on ...
Besides, we designed a new feature, named materialized column, to solve all above problems transparently for arbitrary columnar storage (not only for Parquet)....
Read more >RecordMaterializer (Apache Parquet Column 1.9.0 API)
Top-level class which should be implemented in order to materialize objects from a stream of Parquet data. Each record will be wrapped by ......
Read more >Support Parquet as a source format · Issue #386 - GitHub
+1 for Parquet support. I can see definite use cases for using Materialize to process our parquet datasets stored on Azure blob.
Read more >Using Parquet Data Files - Cloudera Documentation
Parquet is a column-oriented binary file format intended to be highly efficient for the types of large-scale queries. Parquet is suitable for queries...
Read more >parquet late column materialization-Apache Mail Archives - Re
Re: parquet late column materialization ... Hi @CPC, Parquet is column storage format, so if you want to read data from only one...
Read more >Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start FreeTop Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Top GitHub Comments
Let me answer that slightly differently. Why use parquet vs duckdb files? Imo the answer today is twofold: 1) parquet is more broadly compatible. 2) the duckdb file format is unstable.
Both of these problems are transitory. The momentum behind duckdb will only increase adoption to allow duckdb files to be used directly, and duckdb will have a stable file format soon enough. Furthermore, I believe that the duckdb team is motivated to make the file format as fast to query as possible - that is to say - make different tradeoffs from the parquet format.
So while I personally use that workflow, its hard for me to say how broadly applicable it is because I’m doing it as something that works now but I expect better patterns in the near future. In fact, what you all are doing now may change the workflow. Hard to say until I actually use it.
@jpmmcneill many good questions, let me try to hit a couple of them here.
On why dbt would fail-- I think that the only required feature from your PR that is missing is an impl of
list_relations_without_caching
on the adapter (which is usually implemented via queries to theinformation_schema
), which I put into my attempt to make this work here: https://github.com/jwills/dbt-duckdb/compare/jwills_file_based_dbs?expand=1I think it’s worthwhile to compare your attempt and mine to illustrate what makes me uncomfortable going down the road I illustrate in the PR: we’re essentially inventing our own database catalog system in
dbt-duckdb
in order to support this feature. It is admittedly a very simple database catalog (relations are parquet/csv files on disk, databases/schemas are directories-- although you and I differ on how we support schemas in our approaches), which is why I get why it’s tempting to do it, but I think that going down this road is going to take us to a bad place where we’re essentially reinventing a database catalog in python using a tool (dbt) that was never meant to do such a thing. The right place for the database catalog to live is in the database, which is why I like @tomsej’s solution here w/views (an entry in the DuckDB catalog) backed by files on disk or object storage. If you think about this for a bit, and you still think I’m wrong, then I say let’s go off and create adbt-parquet
project that uses the DuckDB engine but implements the database catalog purely as files on disk with whatever layout you like, without having the constraint of maintaining compatibility with whatever future features the DuckDB folks come up with. TBH I’m sure someone is going to do this at some point anyway, so it might as well be someone who I trust to do it well. 😉Second question: the metadata is required for the required
list_relations_without_caching
method and fordbt docs
generation (column names/types.)Third question, re: memory dbs: I agree with you, I don’t think that a pure in-memory duckdb instance for a dbt run is a good idea. But to be clear we’re not requiring anyone to run stuff in memory in order to use the parquet/csv materialization types, it’s just something that a user can do if they are extremely confident in their SQL abilities and the quality of their input data.
Re: in memory connections: I don’t totally understand the issue you’re raising, but in dbt-duckdb there is a single parent connection and all of the threads create their own child connection from it, which happens here: https://github.com/jwills/dbt-duckdb/blob/master/dbt/adapters/duckdb/connections.py#L109