Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

[SUPPORT] CoW: Hudi Upsert not working when there is a timestamp field in the composite key

See original GitHub issue

Tips before filing an issue

Have you gone through our FAQs?
Join the mailing list to engage in conversations and get faster support at dev-subscribe@hudi.apache.org.
If you have triaged this as a bug, then file an issue directly.

Describe the problem you faced

A batch process with updates to the existing tables in Datalake. These are Hive external partitioned tables with location pointed to a s3 directory. I’m working on PoC to migrate all the tables to Hudi. I did a bulk_insert for IDL and everything went fine. For upserts, I have a problem. My primary key combo has a timestamp field in it. I have added all the required config in my code. Data are getting duplicated because of the timestamp field generating differently in the recordkey.field while upsert operation. Below are my hudi options: hudi_options = { ‘hoodie.table.name’: ‘f_claim_mdcl_hudi_cow’, ‘hoodie.datasource.write.recordkey.field’: ‘claim_id,pat_id,claim_subm_dt,plac_of_srvc_cd,src_pri_psbr_id,src_plan_id’ ‘hoodie.datasource.write.partitionpath.field’: ‘src_sys_nm,yr_mth’, ‘hoodie.datasource.write.table.Type’: ‘COPY_ON_WRITE’, ‘hoodie.datasource.write.table.name’: ‘f_hudi_cow’, # ‘hoodie.combine.before.insert’: ‘false’, ‘hoodie.combine.before.upsert’: ‘true’, ‘hoodie.datasource.hive_sync.enable’: ‘true’, ‘hoodie.datasource.hive_sync.table’: ‘f_hudi_cow’, ‘hoodie.datasource.hive_sync.partition_fields’: ‘src_sys_nm,yr_mth’, ‘hoodie.datasource.hive_sync.partition_extractor_class’: ‘org.apache.hudi.hive.MultiPartKeysValueExtractor’, ‘hoodie.datasource.write.hive_style_partitioning’: ‘true’, ‘hoodie.datasource.hive_sync.database’: ‘us_commercial_datalake_app_commons_dev’, ‘hoodie.datasource.hive_sync.support_timestamp’: ‘true’, ‘hoodie.datasource.hive_sync.auto_create_db’:‘false’, ‘hoodie.datasource.write.keygenerator.class’: ‘org.apache.hudi.keygen.ComplexKeyGenerator’, ‘hoodie.datasource.write.row.writer.enable’: ‘true’, ‘hoodie.parquet.small.file.limit’: ‘600000000’, ‘hoodie.parquet.max.file.size’: ‘1000000000’, ‘hoodie.upsert.shuffle.parallelism’: ‘10000’, ‘hoodie.insert.shuffle.parallelism’: ‘10000’, ‘hoodie.clean.automatic’: ‘false’, ‘hoodie.cleaner.commits.retained’: 3, ‘hoodie.index.type’: ‘GLOBAL_SIMPLE’, ‘hoodie.simple.index.update.partition.path’:‘true’, ‘hoodie.metadata.enable’: ‘true’ }

df.write.format(“org.apache.hudi”).
options(**hudi_options).option(‘hoodie.datasource.write.operation’, ‘upsert’).
mode(“APPEND”).
save(“{s3_path}”) I don’t get any errors while processing. My record key for the bulk insert looks like this:

after Bulk insert, _hoodie_record_key: claim_id:10420217599403398158,pat_id:8607357348,claim_subm_dt:2020-11-21 00:00:00.0,plac_of_srvc_cd:INPATIENT HOSPITAL,src_pri_psbr_id:7605954,src_plan_id:0009659999

after Upsert , another key was added to the same record with _hoodie_record_key: claim_id:10420217599403398158,pat_id:8607357348,claim_subm_dt:1605916800000000,plac_of_srvc_cd:INPATIENT HOSPITAL,src_pri_psbr_id:7605954,src_plan_id:0009659999

To Reproduce

Steps to reproduce the behavior:

Generate a set of records with timestamp as one of the primary keys in Hive external table stored on s3
Load the same set of records with mode(“append”) and option(‘hoodie.datasource.write.operation’, ‘upsert’)
Check for duplicates excluding in the data

Expected behavior

No duplicates in the data. Recordkey.field to remain the same for timestamp field and not get converted to long

Environment Description

Hudi version : 0.7.0 installed in EMR 5.33
Spark version : 2.4.7
Hive version : 2.3.7
Hadoop version : Amazon 2.10.1
Storage (HDFS/S3/GCS…) : s3
Running on Docker? (yes/no) : No

Additional context

Add any other context about the problem here.

Stacktrace

Add the stacktrace of the error.

Issue Analytics

State:
Created 2 years ago
Comments:18 (12 by maintainers)

Top GitHub Comments

1reaction

mkk1490commented, Jul 30, 2021

@nsivabalan I set the row_writer property to False and ingested the data. Now, timestamp gets converted to their respective epoch seconds and long datatype in hoodie_key

This actually solves my issue since during upsert, the key would be in sync with the IDL key. But bulk_insert with row.writer:False is very slow. It actually takes double the time for the same data ingestion.

0reactions

abhishekshenoycommented, Feb 28, 2022

@nsivabalan i see the issue is closed . But in 0.10.1 i still face the duplicate issue when i provide a timestamp column as part of composite key.

    hoodiConfigs.put("hoodie.insert.shuffle.parallelism", "1")
    hoodiConfigs.put("hoodie.upsert.shuffle.parallelism", "1")
    hoodiConfigs.put("hoodie.bulkinsert.shuffle.parallelism", "1")
    hoodiConfigs.put("hoodie.delete.shuffle.parallelism", "1")
    hoodiConfigs.put("hoodie.datasource.write.row.writer.enable", "true")
    hoodiConfigs.put("hoodie.table.keygenerator.class", classOf[ComplexKeyGenerator].getName)
    hoodiConfigs.put("hoodie.datasource.write.keygenerator.class", classOf[ComplexKeyGenerator].getName)
    hoodiConfigs.put("hoodie.datasource.write.recordkey.field", "transactionId,storeNbr,transactionTs")
    hoodiConfigs.put("hoodie.datasource.write.precombine.field", "messageMetadata.srcLoadTs")
    hoodiConfigs.put("hoodie.table.precombine.field", "messageMetadata.srcLoadTs")
    hoodiConfigs.put("hoodie.datasource.write.partitionpath.field", "transactionDt")
    hoodiConfigs.put("hoodie.datasource.write.payload.class",classOf[DefaultHoodieRecordPayload].getName)
    hoodiConfigs.put("hoodie.datasource.write.hive_style_partitioning", "true")
    hoodiConfigs.put("hoodie.datasource.write.table.type",COW_TABLE_TYPE_OPT_VAL)
    hoodiConfigs.put("hoodie.combine.before.upsert","true")
    hoodiConfigs.put("hoodie.table.name","huditransaction")
    hoodiConfigs.put("hoodie.datasource.write.keygenerator.consistent.logical.timestamp.enabled","true")

With BULK_INSERT_OPERATION_OPT_VAL

Output dataset after first Insert does not dedupe records within the batch even on setting combine before insert to true Key (abc, 4162 , 2022-02-25 05:08:10.73)

+-------------------+---------------------+---------------------------------------------------------------------+------------------------+-----------------------------------------------------------------------+-------------+--------+-----------------------+--------------------------------------------------+----------+--------+---------+----------------+-------------+
|_hoodie_commit_time|_hoodie_commit_seqno |_hoodie_record_key                                                   |_hoodie_partition_path  |_hoodie_file_name                                                      |transactionId|storeNbr|transactionTs          |messageMetadata                                   |prefixes  |dummyInt|dummyLong|dummyObjects    |transactionDt|
+-------------------+---------------------+---------------------------------------------------------------------+------------------------+-----------------------------------------------------------------------+-------------+--------+-----------------------+--------------------------------------------------+----------+--------+---------+----------------+-------------+
|20220228210614823  |20220228210614823_0_1|transactionId:abc,storeNbr:4162,transactionTs:2022-02-25 05:08:10.073|transactionDt=2022-02-25|d572dc96-ed78-46ae-8560-430d82456941-0_0-10-0_20220228210614823.parquet|abc          |4162    |2022-02-25 05:08:10.073|{key, value, 1, 2, 2022-02-25 05:09:10, 1, { -> }}|[abc, def]|1       |1        |[{a, 1}, {a, 1}]|2022-02-25   |
|20220228210614823  |20220228210614823_0_2|transactionId:abc,storeNbr:4162,transactionTs:2022-02-25 05:08:10.073|transactionDt=2022-02-25|d572dc96-ed78-46ae-8560-430d82456941-0_0-10-0_20220228210614823.parquet|abc          |4162    |2022-02-25 05:08:10.073|{key, value, 1, 2, 2022-02-26 05:09:10, 1, { -> }}|[abc, def]|1       |1        |[{a, 1}, {a, 1}]|2022-02-25   |
|20220228210614823  |20220228210614823_0_3|transactionId:bcd,storeNbr:4162,transactionTs:2022-02-25 05:08:10.073|transactionDt=2022-02-25|d572dc96-ed78-46ae-8560-430d82456941-0_0-10-0_20220228210614823.parquet|bcd          |4162    |2022-02-25 05:08:10.073|{key, value, 1, 2, 2022-02-25 05:09:10, 1, { -> }}|[abc, def]|1       |1        |[{a, 1}, {a, 1}]|2022-02-25   |
|20220228210614823  |20220228210614823_0_4|transactionId:cde,storeNbr:4163,transactionTs:2022-02-25 05:08:10.073|transactionDt=2022-02-25|d572dc96-ed78-46ae-8560-430d82456941-0_0-10-0_20220228210614823.parquet|cde          |4163    |2022-02-25 05:08:10.073|{key, value, 1, 2, 2022-02-25 05:09:10, 1, { -> }}|[abc, def]|1       |1        |[{a, 1}, {a, 1}]|2022-02-25   |
|20220228210614823  |20220228210614823_0_5|transactionId:def,storeNbr:4163,transactionTs:2022-02-25 05:08:10.073|transactionDt=2022-02-25|d572dc96-ed78-46ae-8560-430d82456941-0_0-10-0_20220228210614823.parquet|def          |4163    |2022-02-25 05:08:10.073|{key, value, 1, 2, 2022-02-25 05:09:10, 1, { -> }}|[abc, def]|1       |1        |[{a, 1}, {a, 1}]|2022-02-25   |
+-------------------+---------------------+---------------------------------------------------------------------+------------------------+-----------------------------------------------------------------------+-------------+--------+-----------------------+--------------------------------------------------+----------+--------+---------+----------------+-------------+

Republishing the same data with UPSERT_OPERATION_OPT_VAL will result in duplicates as well as the data in transactionTs and messageMetadata.srcLoadTs for the records loaded using BULK_INSERT_OPERATION_OPT_VAL has changed.

If you see the recordKey field , the transactionTs value is epochTimeStamp for records loaded using UPSERT_OPERATION_OPT_VAL and UnixTimeStamp for records loaded using BULK_INSERT_OPERATION_OPT_VAL.
With UPSERT_OPERATION_OPT_VAL we see Combine before insert work correctly.
The columns transactionTs and messageMetadata.srcLoadTs has its value changed to 1970-01-20 06:39:05.890073

+-------------------+---------------------+---------------------------------------------------------------------+------------------------+------------------------------------------------------------------------+-------------+--------+--------------------------+-----------------------------------------------------+----------+--------+---------+----------------+-------------+
|_hoodie_commit_time|_hoodie_commit_seqno |_hoodie_record_key                                                   |_hoodie_partition_path  |_hoodie_file_name                                                       |transactionId|storeNbr|transactionTs             |messageMetadata                                      |prefixes  |dummyInt|dummyLong|dummyObjects    |transactionDt|
+-------------------+---------------------+---------------------------------------------------------------------+------------------------+------------------------------------------------------------------------+-------------+--------+--------------------------+-----------------------------------------------------+----------+--------+---------+----------------+-------------+
|20220228210614823  |20220228210614823_0_1|transactionId:abc,storeNbr:4162,transactionTs:2022-02-25 05:08:10.073|transactionDt=2022-02-25|d572dc96-ed78-46ae-8560-430d82456941-0_0-10-0_20220228210614823.parquet |abc          |4162    |1970-01-20 06:39:05.890073|{key, value, 1, 2, 1970-01-20 06:39:05.95, 1, { -> }}|[abc, def]|1       |1        |[{a, 1}, {a, 1}]|2022-02-25   |
|20220228210614823  |20220228210614823_0_2|transactionId:abc,storeNbr:4162,transactionTs:2022-02-25 05:08:10.073|transactionDt=2022-02-25|d572dc96-ed78-46ae-8560-430d82456941-0_0-10-0_20220228210614823.parquet |abc          |4162    |1970-01-20 06:39:05.890073|{key, value, 1, 2, 1970-01-20 06:40:32.35, 1, { -> }}|[abc, def]|1       |1        |[{a, 1}, {a, 1}]|2022-02-25   |
|20220228210614823  |20220228210614823_0_3|transactionId:bcd,storeNbr:4162,transactionTs:2022-02-25 05:08:10.073|transactionDt=2022-02-25|d572dc96-ed78-46ae-8560-430d82456941-0_0-10-0_20220228210614823.parquet |bcd          |4162    |1970-01-20 06:39:05.890073|{key, value, 1, 2, 1970-01-20 06:39:05.95, 1, { -> }}|[abc, def]|1       |1        |[{a, 1}, {a, 1}]|2022-02-25   |
|20220228210614823  |20220228210614823_0_4|transactionId:cde,storeNbr:4163,transactionTs:2022-02-25 05:08:10.073|transactionDt=2022-02-25|d572dc96-ed78-46ae-8560-430d82456941-0_0-10-0_20220228210614823.parquet |cde          |4163    |1970-01-20 06:39:05.890073|{key, value, 1, 2, 1970-01-20 06:39:05.95, 1, { -> }}|[abc, def]|1       |1        |[{a, 1}, {a, 1}]|2022-02-25   |
|20220228210614823  |20220228210614823_0_5|transactionId:def,storeNbr:4163,transactionTs:2022-02-25 05:08:10.073|transactionDt=2022-02-25|d572dc96-ed78-46ae-8560-430d82456941-0_0-10-0_20220228210614823.parquet |def          |4163    |1970-01-20 06:39:05.890073|{key, value, 1, 2, 1970-01-20 06:39:05.95, 1, { -> }}|[abc, def]|1       |1        |[{a, 1}, {a, 1}]|2022-02-25   |
|20220228210729355  |20220228210729355_0_1|transactionId:bcd,storeNbr:4162,transactionTs:1645745890073000       |transactionDt=2022-02-25|d572dc96-ed78-46ae-8560-430d82456941-0_0-26-20_20220228210729355.parquet|bcd          |4162    |2022-02-25 05:08:10.073   |{key, value, 1, 2, 2022-02-25 05:09:10, 1, { -> }}   |[abc, def]|1       |1        |[{a, 1}, {a, 1}]|2022-02-25   |
|20220228210729355  |20220228210729355_0_2|transactionId:cde,storeNbr:4163,transactionTs:1645745890073000       |transactionDt=2022-02-25|d572dc96-ed78-46ae-8560-430d82456941-0_0-26-20_20220228210729355.parquet|cde          |4163    |2022-02-25 05:08:10.073   |{key, value, 1, 2, 2022-02-25 05:09:10, 1, { -> }}   |[abc, def]|1       |1        |[{a, 1}, {a, 1}]|2022-02-25   |
|20220228210729355  |20220228210729355_0_3|transactionId:def,storeNbr:4163,transactionTs:1645745890073000       |transactionDt=2022-02-25|d572dc96-ed78-46ae-8560-430d82456941-0_0-26-20_20220228210729355.parquet|def          |4163    |2022-02-25 05:08:10.073   |{key, value, 1, 2, 2022-02-25 05:09:10, 1, { -> }}   |[abc, def]|1       |1        |[{a, 1}, {a, 1}]|2022-02-25   |
|20220228210729355  |20220228210729355_0_4|transactionId:abc,storeNbr:4162,transactionTs:1645745890073000       |transactionDt=2022-02-25|d572dc96-ed78-46ae-8560-430d82456941-0_0-26-20_20220228210729355.parquet|abc          |4162    |2022-02-25 05:08:10.073   |{key, value, 1, 2, 2022-02-26 05:09:10, 1, { -> }}   |[abc, def]|1       |1        |[{a, 1}, {a, 1}]|2022-02-25   |
+-------------------+---------------------+---------------------------------------------------------------------+------------------------+------------------------------------------------------------------------+-------------+--------+--------------------------+-----------------------------------------------------+----------+--------+---------+----------------+-------------+

Top Results From Across the Web

[GitHub] [hudi] mkk1490 opened a new issue #3313

For upserts, I have a problem. My primary key combo has a timestamp field in it. I have added all the required config...

Writing Data | Apache Hudi

Record keys can either be a single column or refer to multiple columns. ... Example: Upsert a DataFrame, specifying the necessary field names...

Troubleshooting - Apache Hudi

This error generally occurs when the schema has evolved in backwards incompatible way by deleting some column 'col1' and we are trying to...

Writing Data - Apache Hudi

UPSERT : This is the default operation where the input records are first tagged as ... keys and timestamp based partition paths (composite...

Configurations - Apache Hudi

These configs control the Hudi Spark Datasource, providing ability to define keys/partitioning, pick out the write operation, specify how to merge records or ......