Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Inconsistency and errors between different parquet engines on Spark-generated parquet files.

See original GitHub issue

Creation of two partitioned parquet files using Spark:

import pandas as pd

data = [
    [0, 1, 2],
    [0, 2, 3]
]

(
    spark
    .createDataFrame(pd.DataFrame(data, columns = ['col1', 'col2', 'col3']))
    .write.mode('overwrite').partitionBy('col1', 'col2')
    .parquet('test.parquet')
)

data = [
    [0, 1, 2],
    [0, 1, 3]
]

(
    spark
    .createDataFrame(pd.DataFrame(data, columns = ['col1', 'col2', 'col3']))
    .write.mode('overwrite').partitionBy('col1', 'col2')
    .parquet('test2.parquet')
)

Content of the parquet files:

test.parquet
├── col1=0
│   ├── col2=1
│   │   └── part-00000-0787dc35-33b6-46a8-8fc9-a88b443877d3.c000.snappy.parquet
│   └── col2=2
│       └── part-00001-0787dc35-33b6-46a8-8fc9-a88b443877d3.c000.snappy.parquet
└── _SUCCESS
test2.parquet
├── col1=0
│   └── col2=1
│       ├── part-00000-c021737d-18be-4dde-a721-842afe197a42.c000.snappy.parquet
│       └── part-00001-c021737d-18be-4dde-a721-842afe197a42.c000.snappy.parquet
└── _SUCCESS

Experiment 1 (`pyarrow-legacy` engine)

>>> ddf = dd.read_parquet('test.parquet', engine='pyarrow-legacy') 
>>> print(ddf)
Dask DataFrame Structure:
                col3             col1             col2
npartitions=2                                         
               int64  category[known]  category[known]
                 ...              ...              ...
                 ...              ...              ...
Dask Name: read-parquet, 2 tasks
>>> print(ddf.head(npartitions=-1))
   col3 col1 col2
0     2    0    1
0     3    0    2

>>> ddf2 = dd.read_parquet('test2.parquet', engine='pyarrow-legacy')
>>> print(ddf2.head(npartitions=-1))
   col3 col1 col2
0     2    0    1
0     3    0    1

All works as expected.

Experiment 2 (`pyarrow` engine)

>>> ddf = dd.read_parquet('test.parquet', engine='pyarrow')  ## the same result with pyarrow-dataset engine
>>> print(ddf)
Dask DataFrame Structure:
                col3             col1             col2
npartitions=2                                         
               int64  category[known]  category[known]
                 ...              ...              ...
                 ...              ...              ...
Dask Name: read-parquet, 2 tasks
>>> print(ddf.head(npartitions=-1))
Exception: KeyError("['col1', 'col2'] not in index")

So, something goes wrong with hierarchy-inferred columns. The same exception with with test2.parquet.

Experiment 3 (`fastparquet` engine)

>>> dd.read_parquet('test.parquet', engine='fastparquet')
Exception: OSError: [Errno 22] Invalid argument

Reason : fastparquet engine fails while trying to parse empty _SUCCESS file as parquet file

Ok, removing file (rm test.parquet/_SUCCESS) and start again:

>>> ddf = dd.read_parquet('test.parquet', engine='fastparquet')
>>> print(ddf)
Dask DataFrame Structure:
                col3             col2
npartitions=2                        
               int64  category[known]
                 ...              ...
                 ...              ...
Dask Name: read-parquet, 2 tasks
>>> print(ddf.head(npartitions=-1))
   col3 col2
0     2    0
0     3    0

>>> ddf2 = dd.read_parquet('test2.parquet', engine='fastparquet')
>>> print(ddf2)
Dask DataFrame Structure:
                col3
npartitions=2       
               int64
                 ...
                 ...
Dask Name: read-parquet, 2 tasks
>>> print(ddf.head(npartitions=-1))
   col3
0     2
0     3

Note two big problems:

If hierarchy inferred columns have only one unique value, then that columns simply dropped.
If hierarchy inferred columns have more than one unique value, then all that values turns to zeros.

Environment:

Dask version: 2021.8.1
Python version: 3.7
Operating System: Ubuntu 18.04
Install method (conda, pip, source): conda
Spark: 2.4.8

Issue Analytics

State:
Created 2 years ago
Comments:11 (10 by maintainers)

Top GitHub Comments

1reaction

rjzamoracommented, Oct 4, 2021

Note that #8072 was merged, and so the pyarrow component of this issue should be resolved.

0reactions

rjzamoracommented, Oct 7, 2021

Will the fastparquet component be included in #8092? Or should that be a separate line of work?

I spent some time thinking through the “correct” solution to this, and it is a bit tricky to come up with a convention that will work for both fastparquet and pyarrow. Therefore, I can add a fix to #8092, but I’d prefer to get that merged first, and then work on a fix.

Top Results From Across the Web

How to fix inconsistent schemas in parquet file partition using ...

I have a ton of partitioned files and going through each one to find if the schema is the same and fixing each...

Error writing parquet files - Databricks Community

Hi, we are having this chain of errors every day in different files and processes: An error occurred while calling o11255.parquet. : org.apache.spark....

Troubleshooting — NVTabular 2021 documentation

These scripts check for schema consistency and generate only the _metadata file instead of converting all the parquet files. If the schema is...

TIMESTAMP compatibility for Parquet files | CDP Public Cloud

The following sequence of examples show how, by default, TIMESTAMP values written to a Parquet table by an Apache Impala SQL statement are...

Parquet Files - Spark 3.3.1 Documentation

Spark SQL provides support for both reading and writing Parquet files that automatically preserves the schema of the original data. When reading Parquet...