Parquet `_metadata` too large to parse
See original GitHub issueI encountered the following error (the full traceback is further down):
OSError: Could not open Parquet input source 'oss-shared-scratch/timeseries.parquet/_metadata': Couldn't deserialize thrift: MaxMessageSize reached
when reading in a Parquet dataset on S3 with a large _metadata file. In this particular case the _metadata file was 676.5 MB. Based on the error message it looks like this _metadata file is too big for pyarrow to parse.
cc @jorisvandenbossche @rjzamora
Full traceback:
---------------------------------------------------------------------------
OSError Traceback (most recent call last)
<timed exec> in <module>
~/mambaforge/envs/coiled-jrbourbeau-parquet-demo/lib/python3.9/site-packages/dask/dataframe/io/parquet/core.py in read_parquet(path, columns, filters, categories, index, storage_options, engine, gather_statistics, split_row_groups, read_from_paths, chunksize, aggregate_files, **kwargs)
314 gather_statistics = True
315
--> 316 read_metadata_result = engine.read_metadata(
317 fs,
318 paths,
~/mambaforge/envs/coiled-jrbourbeau-parquet-demo/lib/python3.9/site-packages/dask/dataframe/io/parquet/arrow.py in read_metadata(cls, fs, paths, categories, index, gather_statistics, filters, split_row_groups, read_from_paths, chunksize, aggregate_files, **kwargs)
540 split_row_groups,
541 gather_statistics,
--> 542 ) = cls._gather_metadata(
543 paths,
544 fs,
~/mambaforge/envs/coiled-jrbourbeau-parquet-demo/lib/python3.9/site-packages/dask/dataframe/io/parquet/arrow.py in _gather_metadata(cls, paths, fs, split_row_groups, gather_statistics, filters, index, dataset_kwargs)
1038 if fs.exists(meta_path):
1039 # Use _metadata file
-> 1040 ds = pa_ds.parquet_dataset(
1041 meta_path,
1042 filesystem=fs,
~/mambaforge/envs/coiled-jrbourbeau-parquet-demo/lib/python3.9/site-packages/pyarrow/dataset.py in parquet_dataset(metadata_path, schema, filesystem, format, partitioning, partition_base_dir)
492 )
493
--> 494 factory = ParquetDatasetFactory(
495 metadata_path, filesystem, format, options=options)
496 return factory.finish(schema)
~/mambaforge/envs/coiled-jrbourbeau-parquet-demo/lib/python3.9/site-packages/pyarrow/_dataset.pyx in pyarrow._dataset.ParquetDatasetFactory.__init__()
~/mambaforge/envs/coiled-jrbourbeau-parquet-demo/lib/python3.9/site-packages/pyarrow/error.pxi in pyarrow.lib.pyarrow_internal_check_status()
~/mambaforge/envs/coiled-jrbourbeau-parquet-demo/lib/python3.9/site-packages/pyarrow/error.pxi in pyarrow.lib.check_status()
OSError: Could not open Parquet input source 'oss-shared-scratch/timeseries.parquet/_metadata': Couldn't deserialize thrift: MaxMessageSize reached
Issue Analytics
- State:
- Created 2 years ago
- Comments:23 (18 by maintainers)
Top Results From Across the Web
When Parquet Columns Get Too Big - Vortexa
Here is the parquet-tools meta output for a file I created to reproduce the problem.
Read more >Analyzing Parquet Metadata and Statistics with PyArrow
This post explains how to create a Parquet file with PyArrow and how to read Parquet footer metadata like the compression algorithm and...
Read more >Python Dask 'Metadata parse failed' error when trying to ...
csv file converted using pandas and pyarrow. All files have a datetime index named 'Timestamp', and five columns named 'Open', 'High', 'Low', ' ......
Read more >Demystifying the Parquet File Format - Towards Data Science
In this post we will discuss apache parquet, an extremely efficient… ... chunks of columns, lending to high performance when selecting and filtering...
Read more >Reading and Writing the Apache Parquet Format
ParquetFile('example.parquet') In [20]: parquet_file.metadata Out[20]: ... if the dictionaries grow too large, then they “fall back” to plain encoding.
Read more >
Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free
Top Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found

Thank you for tracking this down
On Wed, Aug 18, 2021 at 8:31 AM Joris Van den Bossche < @.***> wrote:
We no longer write
_metadatafiles by default, and in many cases no longer load file statistics. With that combined with the upstream pyarrow fix, I believe this can be closed.