question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Parquet materialization

See original GitHub issue

👋

Hey!

Something I’ve been thinking about is using parquet as a materialization. To be specific, one where dbt-duckdb would use the underlying duckdb connection as a shim to edit parquet rather than adding some tables into the .duckdb file.

I’m not sure if it’s possible to override what ref does in an adapter, but this would roughly need to do two things:

  1. Make a new materialization that just took the query for some model and executes copy (select ...) to 'model_name.parquet' (format parquet) rather thancreate table model_name as (select ...) for parquet materialized models.
  2. Update ref to use read_parquet(ref'd model name) rather than schema.table_name if the model being referenced was materialized as parquet.

Would love to hear what other people think / if this would be useful!

This is loosely related to: https://github.com/jwills/dbt-duckdb/issues/15

Issue Analytics

  • State:closed
  • Created a year ago
  • Reactions:3
  • Comments:47 (14 by maintainers)

github_iconTop GitHub Comments

4reactions
matsonjcommented, Oct 26, 2022

Do you see the pattern of using a duckdb view / table for development and then parquet for production being a common one?

Let me answer that slightly differently. Why use parquet vs duckdb files? Imo the answer today is twofold: 1) parquet is more broadly compatible. 2) the duckdb file format is unstable.

Both of these problems are transitory. The momentum behind duckdb will only increase adoption to allow duckdb files to be used directly, and duckdb will have a stable file format soon enough. Furthermore, I believe that the duckdb team is motivated to make the file format as fast to query as possible - that is to say - make different tradeoffs from the parquet format.

So while I personally use that workflow, its hard for me to say how broadly applicable it is because I’m doing it as something that works now but I expect better patterns in the near future. In fact, what you all are doing now may change the workflow. Hard to say until I actually use it.

3reactions
jwillscommented, Oct 28, 2022

@jpmmcneill many good questions, let me try to hit a couple of them here.

On why dbt would fail-- I think that the only required feature from your PR that is missing is an impl of list_relations_without_caching on the adapter (which is usually implemented via queries to the information_schema), which I put into my attempt to make this work here: https://github.com/jwills/dbt-duckdb/compare/jwills_file_based_dbs?expand=1

I think it’s worthwhile to compare your attempt and mine to illustrate what makes me uncomfortable going down the road I illustrate in the PR: we’re essentially inventing our own database catalog system in dbt-duckdb in order to support this feature. It is admittedly a very simple database catalog (relations are parquet/csv files on disk, databases/schemas are directories-- although you and I differ on how we support schemas in our approaches), which is why I get why it’s tempting to do it, but I think that going down this road is going to take us to a bad place where we’re essentially reinventing a database catalog in python using a tool (dbt) that was never meant to do such a thing. The right place for the database catalog to live is in the database, which is why I like @tomsej’s solution here w/views (an entry in the DuckDB catalog) backed by files on disk or object storage. If you think about this for a bit, and you still think I’m wrong, then I say let’s go off and create a dbt-parquet project that uses the DuckDB engine but implements the database catalog purely as files on disk with whatever layout you like, without having the constraint of maintaining compatibility with whatever future features the DuckDB folks come up with. TBH I’m sure someone is going to do this at some point anyway, so it might as well be someone who I trust to do it well. 😉

Second question: the metadata is required for the required list_relations_without_caching method and for dbt docs generation (column names/types.)

Third question, re: memory dbs: I agree with you, I don’t think that a pure in-memory duckdb instance for a dbt run is a good idea. But to be clear we’re not requiring anyone to run stuff in memory in order to use the parquet/csv materialization types, it’s just something that a user can do if they are extremely confident in their SQL abilities and the quality of their input data.

Re: in memory connections: I don’t totally understand the issue you’re raising, but in dbt-duckdb there is a single parent connection and all of the threads create their own child connection from it, which happens here: https://github.com/jwills/dbt-duckdb/blob/master/dbt/adapters/duckdb/connections.py#L109

Read more comments on GitHub >

github_iconTop Results From Across the Web

Materialized Column: An Efficient Way to Optimize Queries on ...
Besides, we designed a new feature, named materialized column, to solve all above problems transparently for arbitrary columnar storage (not only for Parquet)....
Read more >
RecordMaterializer (Apache Parquet Column 1.9.0 API)
Top-level class which should be implemented in order to materialize objects from a stream of Parquet data. Each record will be wrapped by ......
Read more >
Support Parquet as a source format · Issue #386 - GitHub
+1 for Parquet support. I can see definite use cases for using Materialize to process our parquet datasets stored on Azure blob.
Read more >
Using Parquet Data Files - Cloudera Documentation
Parquet is a column-oriented binary file format intended to be highly efficient for the types of large-scale queries. Parquet is suitable for queries...
Read more >
parquet late column materialization-Apache Mail Archives - Re
Re: parquet late column materialization ... Hi @CPC, Parquet is column storage format, so if you want to read data from only one...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found