question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Mismatch between parquet statistics and parts silently drops data

See original GitHub issue

What happened: Hi,

I’ve had some difficulty reading in (older) parquet data files, due to not Dask not finding statistics for about 2/3 of the files.

I’m running the following simple code:

df = dask.dataframe.read_parquet(urlpath, engine='fastparquet').persist()

Where urlpath is a glob path that resolves to 5 folders, each with a _metadata and a _common_metadata file, and 17 parquet files (one of the folders has 16). The total number of partitions should be (17*4)+16 = 84, but df.npartitions returns 36; the other files are not included in the df. After a bunch of trying to trace the problem through the code, I ended up putting a print here: https://github.com/dask/dask/blob/7446308083e2f284229a11e9b3e330aa9e73cf96/dask/dataframe/io/parquet/core.py#L347 Which gives me:

len(meta)=0, len(statistics)=36, len(parts)=84, len(index)=15

Due to the zip() on this line: https://github.com/dask/dask/blob/7446308083e2f284229a11e9b3e330aa9e73cf96/dask/dataframe/io/parquet/core.py#L1143-L1150 Only the first 36 of the parts will be included in df.

What you expected to happen:

It is quite probable that the root problem here occurred back when this data was written (although I have no clue how to investigate that further). I can also circumvent the problem by passing in gather_statistics=False. However, I can’t see how cases where len(statistics) != len(parts) would return reliable results, and would at least expect a warning here. Does anyone have pointers on how to further investigate and/or improve Dask’s handling of this (exceptional?) case? Or an explanation on why the current behavior is desired?

Minimal Complete Verifiable Example:

Not possible without the data in question.

Anything else we need to know?:

/

Environment:

  • Dask version: 2021.11.1
  • Fastparquet version: 0.7.1 (currently, for reading, unclear which version was used for writing)
  • Python version: 3.9.7
  • Operating System: Ubuntu 20.04.3 LTS
  • Install method (conda, pip, source): Conda

Issue Analytics

  • State:closed
  • Created 2 years ago
  • Comments:12 (9 by maintainers)

github_iconTop GitHub Comments

1reaction
martindurantcommented, Nov 29, 2021

You might try a glob pattern like “root_dir/**.parquet” (or whatever the data files are called).

I am actually not sure if fastparquet makes the statistics optional

Yes!

    stats: True|False|list(str)
        Whether to calculate and write summary statistics. If True (default), do it for
        every column; if False, never do; and if a list of str, do it only for those
        specified columns.
1reaction
martindurantcommented, Nov 22, 2021

At a quick skim of the report, I imagine indeed we don’t test for a case where only some of the data contains statistics. The statistics should be aligned with the “parts”, so that you only exclude parts when you can be sure there is no good data there.

Read more comments on GitHub >

github_iconTop Results From Across the Web

Handling Parquet Schema Mismatch Based on Data
Here, we have a dataframe with two columns - with the customerProducts col storing a list of strings as data.
Read more >
MSCK REPAIR TABLE - Amazon Athena - AWS Documentation
Due to a known issue, MSCK REPAIR TABLE fails silently when partition values contain a colon ( : ) character (for example, when...
Read more >
Fixed Issues in Apache Impala
Queries on Parquet tables could consume excessive memory (potentially multiple gigabytes) due to producing large intermediate data values while evaluating ...
Read more >
Error conditions in Azure Databricks - Microsoft Learn
This is a list of common, named error conditions returned by Azure Databricks. ... Cannot resolve due to data type mismatch:.
Read more >
Formats for Input and Output Data | ClickHouse Docs
A format supported for output can be used to arrange the results of a SELECT , and to perform INSERT s into a...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found