Inconsistency and errors between different parquet engines on Spark-generated parquet files.
See original GitHub issueCreation of two partitioned parquet files using Spark:
import pandas as pd
data = [
[0, 1, 2],
[0, 2, 3]
]
(
spark
.createDataFrame(pd.DataFrame(data, columns = ['col1', 'col2', 'col3']))
.write.mode('overwrite').partitionBy('col1', 'col2')
.parquet('test.parquet')
)
data = [
[0, 1, 2],
[0, 1, 3]
]
(
spark
.createDataFrame(pd.DataFrame(data, columns = ['col1', 'col2', 'col3']))
.write.mode('overwrite').partitionBy('col1', 'col2')
.parquet('test2.parquet')
)
Content of the parquet files:
test.parquet
├── col1=0
│ ├── col2=1
│ │ └── part-00000-0787dc35-33b6-46a8-8fc9-a88b443877d3.c000.snappy.parquet
│ └── col2=2
│ └── part-00001-0787dc35-33b6-46a8-8fc9-a88b443877d3.c000.snappy.parquet
└── _SUCCESS
test2.parquet
├── col1=0
│ └── col2=1
│ ├── part-00000-c021737d-18be-4dde-a721-842afe197a42.c000.snappy.parquet
│ └── part-00001-c021737d-18be-4dde-a721-842afe197a42.c000.snappy.parquet
└── _SUCCESS
Experiment 1 (pyarrow-legacy
engine)
>>> ddf = dd.read_parquet('test.parquet', engine='pyarrow-legacy')
>>> print(ddf)
Dask DataFrame Structure:
col3 col1 col2
npartitions=2
int64 category[known] category[known]
... ... ...
... ... ...
Dask Name: read-parquet, 2 tasks
>>> print(ddf.head(npartitions=-1))
col3 col1 col2
0 2 0 1
0 3 0 2
>>> ddf2 = dd.read_parquet('test2.parquet', engine='pyarrow-legacy')
>>> print(ddf2.head(npartitions=-1))
col3 col1 col2
0 2 0 1
0 3 0 1
All works as expected.
Experiment 2 (pyarrow
engine)
>>> ddf = dd.read_parquet('test.parquet', engine='pyarrow') ## the same result with pyarrow-dataset engine
>>> print(ddf)
Dask DataFrame Structure:
col3 col1 col2
npartitions=2
int64 category[known] category[known]
... ... ...
... ... ...
Dask Name: read-parquet, 2 tasks
>>> print(ddf.head(npartitions=-1))
Exception: KeyError("['col1', 'col2'] not in index")
So, something goes wrong with hierarchy-inferred columns. The same exception with with test2.parquet
.
Experiment 3 (fastparquet
engine)
>>> dd.read_parquet('test.parquet', engine='fastparquet')
Exception: OSError: [Errno 22] Invalid argument
Reason : fastparquet
engine fails while trying to parse empty _SUCCESS
file as parquet file
Ok, removing file (rm test.parquet/_SUCCESS
) and start again:
>>> ddf = dd.read_parquet('test.parquet', engine='fastparquet')
>>> print(ddf)
Dask DataFrame Structure:
col3 col2
npartitions=2
int64 category[known]
... ...
... ...
Dask Name: read-parquet, 2 tasks
>>> print(ddf.head(npartitions=-1))
col3 col2
0 2 0
0 3 0
>>> ddf2 = dd.read_parquet('test2.parquet', engine='fastparquet')
>>> print(ddf2)
Dask DataFrame Structure:
col3
npartitions=2
int64
...
...
Dask Name: read-parquet, 2 tasks
>>> print(ddf.head(npartitions=-1))
col3
0 2
0 3
Note two big problems:
- If hierarchy inferred columns have only one unique value, then that columns simply dropped.
- If hierarchy inferred columns have more than one unique value, then all that values turns to zeros.
Environment:
- Dask version: 2021.8.1
- Python version: 3.7
- Operating System: Ubuntu 18.04
- Install method (conda, pip, source): conda
- Spark: 2.4.8
Issue Analytics
- State:
- Created 2 years ago
- Comments:11 (10 by maintainers)
Top Results From Across the Web
How to fix inconsistent schemas in parquet file partition using ...
I have a ton of partitioned files and going through each one to find if the schema is the same and fixing each...
Read more >Error writing parquet files - Databricks Community
Hi, we are having this chain of errors every day in different files and processes: An error occurred while calling o11255.parquet. : org.apache.spark....
Read more >Troubleshooting — NVTabular 2021 documentation
These scripts check for schema consistency and generate only the _metadata file instead of converting all the parquet files. If the schema is...
Read more >TIMESTAMP compatibility for Parquet files | CDP Public Cloud
The following sequence of examples show how, by default, TIMESTAMP values written to a Parquet table by an Apache Impala SQL statement are...
Read more >Parquet Files - Spark 3.3.1 Documentation
Spark SQL provides support for both reading and writing Parquet files that automatically preserves the schema of the original data. When reading Parquet...
Read more >
Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free
Top Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Note that #8072 was merged, and so the pyarrow component of this issue should be resolved.
I spent some time thinking through the “correct” solution to this, and it is a bit tricky to come up with a convention that will work for both fastparquet and pyarrow. Therefore, I can add a fix to #8092, but I’d prefer to get that merged first, and then work on a fix.