question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Parquet `_metadata` too large to parse

See original GitHub issue

I encountered the following error (the full traceback is further down):

OSError: Could not open Parquet input source 'oss-shared-scratch/timeseries.parquet/_metadata': Couldn't deserialize thrift: MaxMessageSize reached

when reading in a Parquet dataset on S3 with a large _metadata file. In this particular case the _metadata file was 676.5 MB. Based on the error message it looks like this _metadata file is too big for pyarrow to parse.

cc @jorisvandenbossche @rjzamora

Full traceback:
---------------------------------------------------------------------------
OSError                                   Traceback (most recent call last)
<timed exec> in <module>

~/mambaforge/envs/coiled-jrbourbeau-parquet-demo/lib/python3.9/site-packages/dask/dataframe/io/parquet/core.py in read_parquet(path, columns, filters, categories, index, storage_options, engine, gather_statistics, split_row_groups, read_from_paths, chunksize, aggregate_files, **kwargs)
    314         gather_statistics = True
    315 
--> 316     read_metadata_result = engine.read_metadata(
    317         fs,
    318         paths,

~/mambaforge/envs/coiled-jrbourbeau-parquet-demo/lib/python3.9/site-packages/dask/dataframe/io/parquet/arrow.py in read_metadata(cls, fs, paths, categories, index, gather_statistics, filters, split_row_groups, read_from_paths, chunksize, aggregate_files, **kwargs)
    540             split_row_groups,
    541             gather_statistics,
--> 542         ) = cls._gather_metadata(
    543             paths,
    544             fs,

~/mambaforge/envs/coiled-jrbourbeau-parquet-demo/lib/python3.9/site-packages/dask/dataframe/io/parquet/arrow.py in _gather_metadata(cls, paths, fs, split_row_groups, gather_statistics, filters, index, dataset_kwargs)
   1038             if fs.exists(meta_path):
   1039                 # Use _metadata file
-> 1040                 ds = pa_ds.parquet_dataset(
   1041                     meta_path,
   1042                     filesystem=fs,

~/mambaforge/envs/coiled-jrbourbeau-parquet-demo/lib/python3.9/site-packages/pyarrow/dataset.py in parquet_dataset(metadata_path, schema, filesystem, format, partitioning, partition_base_dir)
    492     )
    493 
--> 494     factory = ParquetDatasetFactory(
    495         metadata_path, filesystem, format, options=options)
    496     return factory.finish(schema)

~/mambaforge/envs/coiled-jrbourbeau-parquet-demo/lib/python3.9/site-packages/pyarrow/_dataset.pyx in pyarrow._dataset.ParquetDatasetFactory.__init__()

~/mambaforge/envs/coiled-jrbourbeau-parquet-demo/lib/python3.9/site-packages/pyarrow/error.pxi in pyarrow.lib.pyarrow_internal_check_status()

~/mambaforge/envs/coiled-jrbourbeau-parquet-demo/lib/python3.9/site-packages/pyarrow/error.pxi in pyarrow.lib.check_status()

OSError: Could not open Parquet input source 'oss-shared-scratch/timeseries.parquet/_metadata': Couldn't deserialize thrift: MaxMessageSize reached

Issue Analytics

  • State:closed
  • Created 2 years ago
  • Comments:23 (18 by maintainers)

github_iconTop GitHub Comments

1reaction
mrocklincommented, Aug 18, 2021

Thank you for tracking this down

On Wed, Aug 18, 2021 at 8:31 AM Joris Van den Bossche < @.***> wrote:

OK, so indeed this is related to the thrift version. In my local development environment, I still had libthrift 0.13, with which it works fine as mentioned above. But the latest releases of pyarrow on conda-forge are build with libthrift 0.14. Creating a new local environment with thrift 0.14 and building master of pyarrow with that, I also get the error.

It seems that Apache Thrift introduced a MaxMessageSize configuration option ( https://github.com/apache/thrift/blob/master/doc/specs/thrift-tconfiguration.md#maxmessagesize) in version 0.14 (https://issues.apache.org/jira/browse/THRIFT-5237).

Fastparquet uses the python bindings of thrift, and it seems that is still at version 0.13 and not yet updated to 0.14 ( https://github.com/conda-forge/thrift-feedstock/). So I suppose this is the reason that the file could be created with the fastparquet engine.

If you have an environment with Arrow built against libthrift 0.13, the file can be read fine. But, with conda-forge, the recent releases are built against 0.14, and you can only get pyarrow 3.0 or lower with libthrift=0.13.

I opened https://issues.apache.org/jira/browse/ARROW-13655 on the Arrow side. I suppose we could expose this configuration option on the Arrow side (but that’s not a short-term workaround).

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/dask/dask/issues/8027#issuecomment-901117860, or unsubscribe https://github.com/notifications/unsubscribe-auth/AACKZTETQUYDLH63PSF5KWTT5OY2VANCNFSM5CCGJUSA . Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&utm_campaign=notification-email .

0reactions
jcristcommented, May 13, 2022

We no longer write _metadata files by default, and in many cases no longer load file statistics. With that combined with the upstream pyarrow fix, I believe this can be closed.

Read more comments on GitHub >

github_iconTop Results From Across the Web

When Parquet Columns Get Too Big - Vortexa
Here is the parquet-tools meta output for a file I created to reproduce the problem.
Read more >
Analyzing Parquet Metadata and Statistics with PyArrow
This post explains how to create a Parquet file with PyArrow and how to read Parquet footer metadata like the compression algorithm and...
Read more >
Python Dask 'Metadata parse failed' error when trying to ...
csv file converted using pandas and pyarrow. All files have a datetime index named 'Timestamp', and five columns named 'Open', 'High', 'Low', ' ......
Read more >
Demystifying the Parquet File Format - Towards Data Science
In this post we will discuss apache parquet, an extremely efficient… ... chunks of columns, lending to high performance when selecting and filtering...
Read more >
Reading and Writing the Apache Parquet Format
ParquetFile('example.parquet') In [20]: parquet_file.metadata Out[20]: ... if the dictionaries grow too large, then they “fall back” to plain encoding.
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found