Question regarding snappy vs pyarrow snappy
See original GitHub issueHi,
For the same dataset, comparing the 2 libs I get totally different parquet file sizes.
The file size from fastparquet
is x4 times larger than the one given by pyarrow
.
I tried changing the values of row_group_offsets
and the x4 figure was the best I got.
The pyarrow
default is 128 MB
.
For what can I attribute this vastly different file size ? How can I reduce it ?
Issue Analytics
- State:
- Created 6 years ago
- Comments:8 (3 by maintainers)
Top Results From Across the Web
Python error using pyarrow - ArrowNotImplementedError ...
The idea is to write a pandas DataFrame as a Parquet Dataset (on Windows) using Snappy compression, and later to process the Parquet...
Read more >Python and Parquet Performance - Data Syndrome
In Pandas, PyArrow, fastparquet, AWS Data Wrangler, PySpark and Dask ... Snappy compression is needed if you want to append data.
Read more >Write Parquet file to disk — write_parquet ... - Apache Arrow
The compression argument can be any of the following (case insensitive): "uncompressed", "snappy", "gzip", "brotli", "zstd", "lz4", "lzo" or "bz2". Only " ...
Read more >pyarrow.parquet.write_table — Apache Arrow v10.0.1
Specify the compression codec, either on a general basis or per-column. Valid values: {'NONE', 'SNAPPY', 'GZIP', 'BROTLI', 'LZ4', 'ZSTD'}.
Read more >DuckDB quacks Arrow: A zero-copy data integration between ...
The datasets may span multiple files in Parquet, CSV, or other formats, and files may even be on remote or cloud storage like...
Read more >Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start FreeTop Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Top GitHub Comments
DictionaryEncoding is a way to save data in Parquet. It is simply a more efficient approach if your data not unique in each row. As far as I understand,
fastparquet
currently does not support writing it or does not have it enabled by default ( @martindurant correct me if I’m wrong). The implications should simply be different file sizes. Reading Parquet files with dictionary encoding is supported by all standard-compliant Parquet readers and should give you the same performance (often it does even get you a significantly better performance as file size is much smaller and this is the read speed limitation if you read e.g. over network).Note that other Parquet implementations: parquet-cpp, parquet-mr, Impala, etc. always dictionary encode all column types by default.