Arrays with nulls in them result in broken parquet files
See original GitHub issueWriting a dataframe with an array column when an array contains a null
causes hudi to write broken parquet.
To Reproduce
Steps to reproduce (using pyspark here):
- Create a spark dataframe where one column is an array, e.g.:
spark_df = spark.createDataFrame([
(1, '2020/10/29', ['NY123', 'LA456']),
(2, '2020/10/29', []),
(3, '2020/10/29', None),
(4, '2020/10/29', [None]), # this row will break things as would [None,None] or [None, 'ABC']
(5, '2020/10/29', ['ABC123']),
], ['hudi_key', 'hudi_partition', 'postcodes'])
- Write it as hudi with spark
hudi_options = {
'hoodie.table.type': 'COPY_ON_WRITE',
'hoodie.table.name': "data",
'hoodie.datasource.write.recordkey.field': 'hudi_key',
'hoodie.datasource.write.precombine.field': 'hudi_partition',
'hoodie.datasource.write.partitionpath.field': 'hudi_partition',
'hoodie.datasource.write.operation': 'upsert',
'hoodie.upsert.shuffle.parallelism': 200,
'hoodie.consistency.check.enabled': True
}
spark_df.write.format("hudi").options(**hudi_options).mode("overwrite").save(hudi_s3_prefix)
- Also write the same df as pure parquet for comparison
spark_df.write.parquet(parquet_s3_prefix), 'overwrite')
-
Hudi uses parquet as underlying format so find the parquet file it wrote on s3 and save a ref to it, same with pure parquet file
-
Reading the parquet file written with Hudi identified in step 4:
- with Spark (
spark.read.schema(spark_df.schema).parquet(...)
) misses some records (seems to be nondeterministic which ones, could be all of them), e.g. returns only 3 out of 5 - with pyarrow (and fastparquet) it fails with an error complaining about the number of values in a column, e.g “ArrowInvalid: Column 7 named postcodes expected length 5 but got length 3” - the number tallies with how many values Spark saw.
- Load the data with Hudi
spark.read.format('hudi').schema(spark_df.schema).load(f"{hudi_s3_prefix}/*/*/*") # (globbing for partitions for year,month,day in this example)
dataframe returned will be empty 😦
- Parquet file written in point 3 (just parquet, no hudi) works fine
Bonus problems
Let’s modify the dataframe to no longer contain the [None]
for one of the rows
spark_df = spark.createDataFrame([
(1, '2020/10/29', ['NY123', 'LA456']),
(2, '2020/10/29', []),
(3, '2020/10/29', None),
#(4, '2020/10/29', [None]),
(5, '2020/10/29', ['ABC123']),
], ['hudi_key', 'hudi_partition', 'postcodes'])
When we repeat the steps above, save with hudi (overriding) to the same location we will be able to correctly read using both spark.read.format('hudi').load(...)
and spark.read.parquet(...)
but pyarrow
will return data with the array column mangled - missing rows, values moved around between rows… e.g.
hudi_key hudi_partition postcodes
2 2020/10/29 []
5 2020/10/29 [None]
1 2020/10/29 [ABC123, NY123] # note that ABC123 was a postcode from hudi_key=5 which now has no postcode!
3 2020/10/29 None
Expected behavior
It’s possible to write a data frame with hudi where one column is an array type and some of those arrays have null(s) in them
Environment Description
-
AWS EMR version: 5.30.1
-
Hudi version : 0.5.2 and 0.6.0
-
Spark version : 2.4.5
-
Hive version : 2.3.6
-
Hadoop version : 2.8.5
-
Storage (HDFS/S3/GCS…) : S3
-
Running on Docker? (yes/no) : no
Additional context
Override vs append
if we started with some valid hudi data and used mode(‘append’) and attempted to add some more with array containing nulls the new data will not be added but old data will stay and is readable. Later adding “good” (no nulls) data will also work. The parquet file for “bad” batch of data will be broken and can’t be correctly read similarly to what was described above.
Issue Analytics
- State:
- Created 3 years ago
- Comments:29 (14 by maintainers)
Top GitHub Comments
@kazdy I did a workaround by changing the schema to just
string
. In any case, the hive table column can be parsed using some UDFs to an array or list@gtwuser As far as I remember I was using OSS Spark 3.1.2 with OSS Hudi, I also ran the same checks on EMR 6.4 at the time with the same results.