question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Inconsistency and errors between different parquet engines on Spark-generated parquet files.

See original GitHub issue

Creation of two partitioned parquet files using Spark:

import pandas as pd

data = [
    [0, 1, 2],
    [0, 2, 3]
]

(
    spark
    .createDataFrame(pd.DataFrame(data, columns = ['col1', 'col2', 'col3']))
    .write.mode('overwrite').partitionBy('col1', 'col2')
    .parquet('test.parquet')
)

data = [
    [0, 1, 2],
    [0, 1, 3]
]

(
    spark
    .createDataFrame(pd.DataFrame(data, columns = ['col1', 'col2', 'col3']))
    .write.mode('overwrite').partitionBy('col1', 'col2')
    .parquet('test2.parquet')
)

Content of the parquet files:

test.parquet
├── col1=0
│   ├── col2=1
│   │   └── part-00000-0787dc35-33b6-46a8-8fc9-a88b443877d3.c000.snappy.parquet
│   └── col2=2
│       └── part-00001-0787dc35-33b6-46a8-8fc9-a88b443877d3.c000.snappy.parquet
└── _SUCCESS
test2.parquet
├── col1=0
│   └── col2=1
│       ├── part-00000-c021737d-18be-4dde-a721-842afe197a42.c000.snappy.parquet
│       └── part-00001-c021737d-18be-4dde-a721-842afe197a42.c000.snappy.parquet
└── _SUCCESS

Experiment 1 (pyarrow-legacy engine)

>>> ddf = dd.read_parquet('test.parquet', engine='pyarrow-legacy') 
>>> print(ddf)
Dask DataFrame Structure:
                col3             col1             col2
npartitions=2                                         
               int64  category[known]  category[known]
                 ...              ...              ...
                 ...              ...              ...
Dask Name: read-parquet, 2 tasks
>>> print(ddf.head(npartitions=-1))
   col3 col1 col2
0     2    0    1
0     3    0    2

>>> ddf2 = dd.read_parquet('test2.parquet', engine='pyarrow-legacy')
>>> print(ddf2.head(npartitions=-1))
   col3 col1 col2
0     2    0    1
0     3    0    1

All works as expected.

Experiment 2 (pyarrow engine)

>>> ddf = dd.read_parquet('test.parquet', engine='pyarrow')  ## the same result with pyarrow-dataset engine
>>> print(ddf)
Dask DataFrame Structure:
                col3             col1             col2
npartitions=2                                         
               int64  category[known]  category[known]
                 ...              ...              ...
                 ...              ...              ...
Dask Name: read-parquet, 2 tasks
>>> print(ddf.head(npartitions=-1))
Exception: KeyError("['col1', 'col2'] not in index")

So, something goes wrong with hierarchy-inferred columns. The same exception with with test2.parquet.

Experiment 3 (fastparquet engine)

>>> dd.read_parquet('test.parquet', engine='fastparquet')
Exception: OSError: [Errno 22] Invalid argument

Reason : fastparquet engine fails while trying to parse empty _SUCCESS file as parquet file

Ok, removing file (rm test.parquet/_SUCCESS) and start again:

>>> ddf = dd.read_parquet('test.parquet', engine='fastparquet')
>>> print(ddf)
Dask DataFrame Structure:
                col3             col2
npartitions=2                        
               int64  category[known]
                 ...              ...
                 ...              ...
Dask Name: read-parquet, 2 tasks
>>> print(ddf.head(npartitions=-1))
   col3 col2
0     2    0
0     3    0

>>> ddf2 = dd.read_parquet('test2.parquet', engine='fastparquet')
>>> print(ddf2)
Dask DataFrame Structure:
                col3
npartitions=2       
               int64
                 ...
                 ...
Dask Name: read-parquet, 2 tasks
>>> print(ddf.head(npartitions=-1))
   col3
0     2
0     3

Note two big problems:

  1. If hierarchy inferred columns have only one unique value, then that columns simply dropped.
  2. If hierarchy inferred columns have more than one unique value, then all that values turns to zeros.

Environment:

  • Dask version: 2021.8.1
  • Python version: 3.7
  • Operating System: Ubuntu 18.04
  • Install method (conda, pip, source): conda
  • Spark: 2.4.8

Issue Analytics

  • State:closed
  • Created 2 years ago
  • Comments:11 (10 by maintainers)

github_iconTop GitHub Comments

1reaction
rjzamoracommented, Oct 4, 2021

Note that #8072 was merged, and so the pyarrow component of this issue should be resolved.

0reactions
rjzamoracommented, Oct 7, 2021

Will the fastparquet component be included in #8092? Or should that be a separate line of work?

I spent some time thinking through the “correct” solution to this, and it is a bit tricky to come up with a convention that will work for both fastparquet and pyarrow. Therefore, I can add a fix to #8092, but I’d prefer to get that merged first, and then work on a fix.

Read more comments on GitHub >

github_iconTop Results From Across the Web

How to fix inconsistent schemas in parquet file partition using ...
I have a ton of partitioned files and going through each one to find if the schema is the same and fixing each...
Read more >
Error writing parquet files - Databricks Community
Hi, we are having this chain of errors every day in different files and processes: An error occurred while calling o11255.parquet. : org.apache.spark....
Read more >
Troubleshooting — NVTabular 2021 documentation
These scripts check for schema consistency and generate only the _metadata file instead of converting all the parquet files. If the schema is...
Read more >
TIMESTAMP compatibility for Parquet files | CDP Public Cloud
The following sequence of examples show how, by default, TIMESTAMP values written to a Parquet table by an Apache Impala SQL statement are...
Read more >
Parquet Files - Spark 3.3.1 Documentation
Spark SQL provides support for both reading and writing Parquet files that automatically preserves the schema of the original data. When reading Parquet...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found