question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

[SUPPORT] Duplicate records for same recordKey and differents partitionpath

See original GitHub issue

Describe the problem you faced

I’ve created a simple script to test insert and upsert operations. When I run upsert operation for a given record but with different partition field column value, hudi duplicates the record.

To Reproduce

My script

df = spark.read.parquet(....)

partition_field = 'is_syncing'
database_name = 'hudi_sync'
hudi_table_path = *****

hudi_options = {
    'hoodie.table.name': tableName,
    'hoodie.datasource.write.recordkey.field': 'uuid',
    'hoodie.datasource.write.partitionpath.field': partition_field,
    'hoodie.datasource.write.table.name': tableName,
    'hoodie.datasource.write.operation': 'bulk_insert',
    'hoodie.datasource.write.precombine.field': 'updated_at',
    'hoodie.upsert.shuffle.parallelism': 8,
    'hoodie.insert.shuffle.parallelism': 8,
    'hoodie.datasource.write.hive_style_partitioning': 'true',
    'hoodie.datasource.hive_sync.partition_extractor_class': 'org.apache.hudi.hive.MultiPartKeysValueExtractor',
    'hoodie.datasource.hive_sync.enable': 'true',
    'hoodie.datasource.hive_sync.table': tableName,
    'hoodie.datasource.hive_sync.mode': 'hms',
    'hoodie.datasource.hive_sync.support_timestamp': 'true',
    'hoodie.datasource.hive_sync.database': database_name,
    'hoodie.datasource.hive_sync.partition_fields': partition_field,
    'hoodie.datasource.write.table.type': 'COPY_ON_WRITE'
}

df.write.format("org.apache.hudi") \
    .options(**hudi_options) \
    .mode("overwrite") \
    .save(hudi_table_path)

df = spark.table(f'{database_name}.{tableName}')

payment_type_cash = df \
    .filter('payment_type = "cash"')

payment_type_cash \
    .withColumn('is_syncing', lit('true')) \
    .withColumn('hudi_etl_pipeline', lit('true')) \
    .write \
    .format("org.apache.hudi") \
    .options(**hudi_options) \
    .option('hoodie.datasource.write.operation', 'upsert') \
    .mode("append") \
    .save(hudi_table_path)

Expected behavior

Hudi should updates _hoodie_partition_path and is_syncing column, not create a new record.

Environment Description

  • Hudi version : 0.10.0

  • Spark version : 3.1.2

  • Hive version : AWS Glue Data Catalog

  • Hadoop version :

  • Storage (HDFS/S3/GCS…) : S3

  • Running on Docker? (yes/no) : Yes, EMR on EKS. I overwritten hudi version (0.8.0) of EMR 6.4.0 release to hudi 0.10.0.

Query on AWS Athena

image

Issue Analytics

  • State:closed
  • Created 2 years ago
  • Comments:6 (4 by maintainers)

github_iconTop GitHub Comments

1reaction
nsivabalancommented, Jan 3, 2022

Can you try with a new table. usually changing index types, partition path fields or record key fields is not recommended for a given table.

1reaction
nsivabalancommented, Jan 3, 2022

yes, its backwards incompatible change.

Read more comments on GitHub >

github_iconTop Results From Across the Web

[GitHub] [hudi] jasondavindev opened a new issue #4501
When I run upsert operation for a given record but with different partition field column value, hudi duplicates the record.
Read more >
Duplicates record keys in apache HUDI - Stack Overflow
When two records have the same key value, we will pick the one with the largest value for the precombine field, determined by...
Read more >
FAQs - Apache Hudi
When writing data into Hudi, you model the records like how you would on a key-value store - specify a key field (unique...
Read more >
How does Lake House work: using Apache Hudi as an example
The partition path combined with record key is called Hoddie key in Hudi. ... If we do the same analysis for different storage...
Read more >
DUPLICATES Phrase - IBM
This allows the file to have keys with the same values. ... The term duplicate key applies only to a complete record key...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found