question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Question regarding snappy vs pyarrow snappy

See original GitHub issue

Hi, For the same dataset, comparing the 2 libs I get totally different parquet file sizes. The file size from fastparquet is x4 times larger than the one given by pyarrow.

I tried changing the values of row_group_offsets and the x4 figure was the best I got. The pyarrow default is 128 MB.

For what can I attribute this vastly different file size ? How can I reduce it ?

Issue Analytics

  • State:closed
  • Created 6 years ago
  • Comments:8 (3 by maintainers)

github_iconTop GitHub Comments

1reaction
xhochycommented, Nov 5, 2017

DictionaryEncoding is a way to save data in Parquet. It is simply a more efficient approach if your data not unique in each row. As far as I understand, fastparquet currently does not support writing it or does not have it enabled by default ( @martindurant correct me if I’m wrong). The implications should simply be different file sizes. Reading Parquet files with dictionary encoding is supported by all standard-compliant Parquet readers and should give you the same performance (often it does even get you a significantly better performance as file size is much smaller and this is the read speed limitation if you read e.g. over network).

0reactions
wesmcommented, Nov 6, 2017

Note that other Parquet implementations: parquet-cpp, parquet-mr, Impala, etc. always dictionary encode all column types by default.

Read more comments on GitHub >

github_iconTop Results From Across the Web

Python error using pyarrow - ArrowNotImplementedError ...
The idea is to write a pandas DataFrame as a Parquet Dataset (on Windows) using Snappy compression, and later to process the Parquet...
Read more >
Python and Parquet Performance - Data Syndrome
In Pandas, PyArrow, fastparquet, AWS Data Wrangler, PySpark and Dask ... Snappy compression is needed if you want to append data.
Read more >
Write Parquet file to disk — write_parquet ... - Apache Arrow
The compression argument can be any of the following (case insensitive): "uncompressed", "snappy", "gzip", "brotli", "zstd", "lz4", "lzo" or "bz2". Only " ...
Read more >
pyarrow.parquet.write_table — Apache Arrow v10.0.1
Specify the compression codec, either on a general basis or per-column. Valid values: {'NONE', 'SNAPPY', 'GZIP', 'BROTLI', 'LZ4', 'ZSTD'}.
Read more >
DuckDB quacks Arrow: A zero-copy data integration between ...
The datasets may span multiple files in Parquet, CSV, or other formats, and files may even be on remote or cloud storage like...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found