Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

[SUPPORT] Duplicate records for same recordKey and differents partitionpath

See original GitHub issue

Describe the problem you faced

I’ve created a simple script to test insert and upsert operations. When I run upsert operation for a given record but with different partition field column value, hudi duplicates the record.

To Reproduce

My script

df = spark.read.parquet(....)

partition_field = 'is_syncing'
database_name = 'hudi_sync'
hudi_table_path = *****

hudi_options = {
    'hoodie.table.name': tableName,
    'hoodie.datasource.write.recordkey.field': 'uuid',
    'hoodie.datasource.write.partitionpath.field': partition_field,
    'hoodie.datasource.write.table.name': tableName,
    'hoodie.datasource.write.operation': 'bulk_insert',
    'hoodie.datasource.write.precombine.field': 'updated_at',
    'hoodie.upsert.shuffle.parallelism': 8,
    'hoodie.insert.shuffle.parallelism': 8,
    'hoodie.datasource.write.hive_style_partitioning': 'true',
    'hoodie.datasource.hive_sync.partition_extractor_class': 'org.apache.hudi.hive.MultiPartKeysValueExtractor',
    'hoodie.datasource.hive_sync.enable': 'true',
    'hoodie.datasource.hive_sync.table': tableName,
    'hoodie.datasource.hive_sync.mode': 'hms',
    'hoodie.datasource.hive_sync.support_timestamp': 'true',
    'hoodie.datasource.hive_sync.database': database_name,
    'hoodie.datasource.hive_sync.partition_fields': partition_field,
    'hoodie.datasource.write.table.type': 'COPY_ON_WRITE'
}

df.write.format("org.apache.hudi") \
    .options(**hudi_options) \
    .mode("overwrite") \
    .save(hudi_table_path)

df = spark.table(f'{database_name}.{tableName}')

payment_type_cash = df \
    .filter('payment_type = "cash"')

payment_type_cash \
    .withColumn('is_syncing', lit('true')) \
    .withColumn('hudi_etl_pipeline', lit('true')) \
    .write \
    .format("org.apache.hudi") \
    .options(**hudi_options) \
    .option('hoodie.datasource.write.operation', 'upsert') \
    .mode("append") \
    .save(hudi_table_path)

Expected behavior

Hudi should updates _hoodie_partition_path and is_syncing column, not create a new record.

Environment Description

Hudi version : 0.10.0
Spark version : 3.1.2
Hive version : AWS Glue Data Catalog
Hadoop version :
Storage (HDFS/S3/GCS…) : S3
Running on Docker? (yes/no) : Yes, EMR on EKS. I overwritten hudi version (0.8.0) of EMR 6.4.0 release to hudi 0.10.0.