Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Question regarding snappy vs pyarrow snappy

See original GitHub issue

Hi, For the same dataset, comparing the 2 libs I get totally different parquet file sizes. The file size from fastparquet is x4 times larger than the one given by pyarrow.

I tried changing the values of row_group_offsets and the x4 figure was the best I got. The pyarrow default is 128 MB.

For what can I attribute this vastly different file size ? How can I reduce it ?

Issue Analytics

State:
Created 6 years ago
Comments:8 (3 by maintainers)

Top GitHub Comments

1reaction

xhochycommented, Nov 5, 2017

DictionaryEncoding is a way to save data in Parquet. It is simply a more efficient approach if your data not unique in each row. As far as I understand, fastparquet currently does not support writing it or does not have it enabled by default ( @martindurant correct me if I’m wrong). The implications should simply be different file sizes. Reading Parquet files with dictionary encoding is supported by all standard-compliant Parquet readers and should give you the same performance (often it does even get you a significantly better performance as file size is much smaller and this is the read speed limitation if you read e.g. over network).

0reactions

wesmcommented, Nov 6, 2017

Note that other Parquet implementations: parquet-cpp, parquet-mr, Impala, etc. always dictionary encode all column types by default.

Top Results From Across the Web

Python error using pyarrow - ArrowNotImplementedError ...

The idea is to write a pandas DataFrame as a Parquet Dataset (on Windows) using Snappy compression, and later to process the Parquet...

Python and Parquet Performance - Data Syndrome

In Pandas, PyArrow, fastparquet, AWS Data Wrangler, PySpark and Dask ... Snappy compression is needed if you want to append data.

Write Parquet file to disk — write_parquet ... - Apache Arrow

The compression argument can be any of the following (case insensitive): "uncompressed", "snappy", "gzip", "brotli", "zstd", "lz4", "lzo" or "bz2". Only " ...

pyarrow.parquet.write_table — Apache Arrow v10.0.1

Specify the compression codec, either on a general basis or per-column. Valid values: {'NONE', 'SNAPPY', 'GZIP', 'BROTLI', 'LZ4', 'ZSTD'}.

DuckDB quacks Arrow: A zero-copy data integration between ...

The datasets may span multiple files in Parquet, CSV, or other formats, and files may even be on remote or cloud storage like...