External materializations memory alloc issue
See original GitHub issueSo not exactly sure what is going on here, the error coming back from dbt is not super clear. It looks duckdb is running out of memory.
The error is:
/usr/local/lib/python3.9/multiprocessing/resource_tracker.py:216: UserWarning: resource_tracker: There appear to be 2 leaked semaphore objects to clean up at shutdown
warnings.warn('resource_tracker: There appear to be %d '
More context
Before external materializations, I was running my dbt project and then copying the data out to specific folders with an on-run-end
macro. This allowed for me to run somewhere in the neighborhood of 100,000 simulations in my project (found over here) on a VM with 8GB of RAM without seeing this issue. The default number of simulations in the project is typically 10k to keep it fast.
However, when using external materializations, I have to reduce the run size to 1k in order for it to run successfully. This leads me to believe that there is a memory “leak” inside of DuckDB. If I were to speculate, DuckDB is holding both the external tables and the duckdb tables in memory, instead of dropping the duckdb tables once the file has been exported to its external storage location.
Issue Analytics
- State:
- Created 10 months ago
- Comments:29 (8 by maintainers)
Top GitHub Comments
@jwills I think I’m fine with closing this, given there are 4 workarounds. 1) set the max memory with
PRAGMA
. 2) use a bigger VM. 3) materialize the problematic tables as tables, and export with a post hook. 4) don’t use external tables and instead use the duckdb database directly.Ok, I also tried to install
duckdb==0.5.2.dev2286
and memory is way better:But I still do not understand why 16Mb Parquet file need this amount of RAM 😕