question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

[SUPPORT] CoW: Hudi Upsert not working when there is a timestamp field in the composite key

See original GitHub issue

Tips before filing an issue

  • Have you gone through our FAQs?

  • Join the mailing list to engage in conversations and get faster support at dev-subscribe@hudi.apache.org.

  • If you have triaged this as a bug, then file an issue directly.

Describe the problem you faced

A batch process with updates to the existing tables in Datalake. These are Hive external partitioned tables with location pointed to a s3 directory. I’m working on PoC to migrate all the tables to Hudi. I did a bulk_insert for IDL and everything went fine. For upserts, I have a problem. My primary key combo has a timestamp field in it. I have added all the required config in my code. Data are getting duplicated because of the timestamp field generating differently in the recordkey.field while upsert operation. Below are my hudi options: hudi_options = { ‘hoodie.table.name’: ‘f_claim_mdcl_hudi_cow’, ‘hoodie.datasource.write.recordkey.field’: ‘claim_id,pat_id,claim_subm_dt,plac_of_srvc_cd,src_pri_psbr_id,src_plan_id’ ‘hoodie.datasource.write.partitionpath.field’: ‘src_sys_nm,yr_mth’, ‘hoodie.datasource.write.table.Type’: ‘COPY_ON_WRITE’, ‘hoodie.datasource.write.table.name’: ‘f_hudi_cow’, # ‘hoodie.combine.before.insert’: ‘false’, ‘hoodie.combine.before.upsert’: ‘true’, ‘hoodie.datasource.hive_sync.enable’: ‘true’, ‘hoodie.datasource.hive_sync.table’: ‘f_hudi_cow’, ‘hoodie.datasource.hive_sync.partition_fields’: ‘src_sys_nm,yr_mth’, ‘hoodie.datasource.hive_sync.partition_extractor_class’: ‘org.apache.hudi.hive.MultiPartKeysValueExtractor’, ‘hoodie.datasource.write.hive_style_partitioning’: ‘true’, ‘hoodie.datasource.hive_sync.database’: ‘us_commercial_datalake_app_commons_dev’, ‘hoodie.datasource.hive_sync.support_timestamp’: ‘true’, ‘hoodie.datasource.hive_sync.auto_create_db’:‘false’, ‘hoodie.datasource.write.keygenerator.class’: ‘org.apache.hudi.keygen.ComplexKeyGenerator’, ‘hoodie.datasource.write.row.writer.enable’: ‘true’, ‘hoodie.parquet.small.file.limit’: ‘600000000’, ‘hoodie.parquet.max.file.size’: ‘1000000000’, ‘hoodie.upsert.shuffle.parallelism’: ‘10000’, ‘hoodie.insert.shuffle.parallelism’: ‘10000’, ‘hoodie.clean.automatic’: ‘false’, ‘hoodie.cleaner.commits.retained’: 3, ‘hoodie.index.type’: ‘GLOBAL_SIMPLE’, ‘hoodie.simple.index.update.partition.path’:‘true’, ‘hoodie.metadata.enable’: ‘true’ }

df.write.format(“org.apache.hudi”).
options(**hudi_options).option(‘hoodie.datasource.write.operation’, ‘upsert’).
mode(“APPEND”).
save(“{s3_path}”) I don’t get any errors while processing. My record key for the bulk insert looks like this:

after Bulk insert, _hoodie_record_key: claim_id:10420217599403398158,pat_id:8607357348,claim_subm_dt:2020-11-21 00:00:00.0,plac_of_srvc_cd:INPATIENT HOSPITAL,src_pri_psbr_id:7605954,src_plan_id:0009659999

after Upsert , another key was added to the same record with _hoodie_record_key: claim_id:10420217599403398158,pat_id:8607357348,claim_subm_dt:1605916800000000,plac_of_srvc_cd:INPATIENT HOSPITAL,src_pri_psbr_id:7605954,src_plan_id:0009659999

To Reproduce

Steps to reproduce the behavior:

  1. Generate a set of records with timestamp as one of the primary keys in Hive external table stored on s3
  2. Load the same set of records with mode(“append”) and option(‘hoodie.datasource.write.operation’, ‘upsert’)
  3. Check for duplicates excluding in the data

Expected behavior

No duplicates in the data. Recordkey.field to remain the same for timestamp field and not get converted to long

Environment Description

  • Hudi version : 0.7.0 installed in EMR 5.33

  • Spark version : 2.4.7

  • Hive version : 2.3.7

  • Hadoop version : Amazon 2.10.1

  • Storage (HDFS/S3/GCS…) : s3

  • Running on Docker? (yes/no) : No

Additional context

Add any other context about the problem here.

Stacktrace

Add the stacktrace of the error.

Issue Analytics

  • State:closed
  • Created 2 years ago
  • Comments:18 (12 by maintainers)

github_iconTop GitHub Comments

1reaction
mkk1490commented, Jul 30, 2021

@nsivabalan I set the row_writer property to False and ingested the data. Now, timestamp gets converted to their respective epoch seconds and long datatype in hoodie_key image

This actually solves my issue since during upsert, the key would be in sync with the IDL key. But bulk_insert with row.writer:False is very slow. It actually takes double the time for the same data ingestion.

0reactions
abhishekshenoycommented, Feb 28, 2022

@nsivabalan i see the issue is closed . But in 0.10.1 i still face the duplicate issue when i provide a timestamp column as part of composite key.

    hoodiConfigs.put("hoodie.insert.shuffle.parallelism", "1")
    hoodiConfigs.put("hoodie.upsert.shuffle.parallelism", "1")
    hoodiConfigs.put("hoodie.bulkinsert.shuffle.parallelism", "1")
    hoodiConfigs.put("hoodie.delete.shuffle.parallelism", "1")
    hoodiConfigs.put("hoodie.datasource.write.row.writer.enable", "true")
    hoodiConfigs.put("hoodie.table.keygenerator.class", classOf[ComplexKeyGenerator].getName)
    hoodiConfigs.put("hoodie.datasource.write.keygenerator.class", classOf[ComplexKeyGenerator].getName)
    hoodiConfigs.put("hoodie.datasource.write.recordkey.field", "transactionId,storeNbr,transactionTs")
    hoodiConfigs.put("hoodie.datasource.write.precombine.field", "messageMetadata.srcLoadTs")
    hoodiConfigs.put("hoodie.table.precombine.field", "messageMetadata.srcLoadTs")
    hoodiConfigs.put("hoodie.datasource.write.partitionpath.field", "transactionDt")
    hoodiConfigs.put("hoodie.datasource.write.payload.class",classOf[DefaultHoodieRecordPayload].getName)
    hoodiConfigs.put("hoodie.datasource.write.hive_style_partitioning", "true")
    hoodiConfigs.put("hoodie.datasource.write.table.type",COW_TABLE_TYPE_OPT_VAL)
    hoodiConfigs.put("hoodie.combine.before.upsert","true")
    hoodiConfigs.put("hoodie.table.name","huditransaction")
    hoodiConfigs.put("hoodie.datasource.write.keygenerator.consistent.logical.timestamp.enabled","true")

With BULK_INSERT_OPERATION_OPT_VAL

Output dataset after first Insert does not dedupe records within the batch even on setting combine before insert to true Key (abc, 4162 , 2022-02-25 05:08:10.73)

+-------------------+---------------------+---------------------------------------------------------------------+------------------------+-----------------------------------------------------------------------+-------------+--------+-----------------------+--------------------------------------------------+----------+--------+---------+----------------+-------------+
|_hoodie_commit_time|_hoodie_commit_seqno |_hoodie_record_key                                                   |_hoodie_partition_path  |_hoodie_file_name                                                      |transactionId|storeNbr|transactionTs          |messageMetadata                                   |prefixes  |dummyInt|dummyLong|dummyObjects    |transactionDt|
+-------------------+---------------------+---------------------------------------------------------------------+------------------------+-----------------------------------------------------------------------+-------------+--------+-----------------------+--------------------------------------------------+----------+--------+---------+----------------+-------------+
|20220228210614823  |20220228210614823_0_1|transactionId:abc,storeNbr:4162,transactionTs:2022-02-25 05:08:10.073|transactionDt=2022-02-25|d572dc96-ed78-46ae-8560-430d82456941-0_0-10-0_20220228210614823.parquet|abc          |4162    |2022-02-25 05:08:10.073|{key, value, 1, 2, 2022-02-25 05:09:10, 1, { -> }}|[abc, def]|1       |1        |[{a, 1}, {a, 1}]|2022-02-25   |
|20220228210614823  |20220228210614823_0_2|transactionId:abc,storeNbr:4162,transactionTs:2022-02-25 05:08:10.073|transactionDt=2022-02-25|d572dc96-ed78-46ae-8560-430d82456941-0_0-10-0_20220228210614823.parquet|abc          |4162    |2022-02-25 05:08:10.073|{key, value, 1, 2, 2022-02-26 05:09:10, 1, { -> }}|[abc, def]|1       |1        |[{a, 1}, {a, 1}]|2022-02-25   |
|20220228210614823  |20220228210614823_0_3|transactionId:bcd,storeNbr:4162,transactionTs:2022-02-25 05:08:10.073|transactionDt=2022-02-25|d572dc96-ed78-46ae-8560-430d82456941-0_0-10-0_20220228210614823.parquet|bcd          |4162    |2022-02-25 05:08:10.073|{key, value, 1, 2, 2022-02-25 05:09:10, 1, { -> }}|[abc, def]|1       |1        |[{a, 1}, {a, 1}]|2022-02-25   |
|20220228210614823  |20220228210614823_0_4|transactionId:cde,storeNbr:4163,transactionTs:2022-02-25 05:08:10.073|transactionDt=2022-02-25|d572dc96-ed78-46ae-8560-430d82456941-0_0-10-0_20220228210614823.parquet|cde          |4163    |2022-02-25 05:08:10.073|{key, value, 1, 2, 2022-02-25 05:09:10, 1, { -> }}|[abc, def]|1       |1        |[{a, 1}, {a, 1}]|2022-02-25   |
|20220228210614823  |20220228210614823_0_5|transactionId:def,storeNbr:4163,transactionTs:2022-02-25 05:08:10.073|transactionDt=2022-02-25|d572dc96-ed78-46ae-8560-430d82456941-0_0-10-0_20220228210614823.parquet|def          |4163    |2022-02-25 05:08:10.073|{key, value, 1, 2, 2022-02-25 05:09:10, 1, { -> }}|[abc, def]|1       |1        |[{a, 1}, {a, 1}]|2022-02-25   |
+-------------------+---------------------+---------------------------------------------------------------------+------------------------+-----------------------------------------------------------------------+-------------+--------+-----------------------+--------------------------------------------------+----------+--------+---------+----------------+-------------+

Republishing the same data with UPSERT_OPERATION_OPT_VAL will result in duplicates as well as the data in transactionTs and messageMetadata.srcLoadTs for the records loaded using BULK_INSERT_OPERATION_OPT_VAL has changed.

  1. If you see the recordKey field , the transactionTs value is epochTimeStamp for records loaded using UPSERT_OPERATION_OPT_VAL and UnixTimeStamp for records loaded using BULK_INSERT_OPERATION_OPT_VAL.
  2. With UPSERT_OPERATION_OPT_VAL we see Combine before insert work correctly.
  3. The columns transactionTs and messageMetadata.srcLoadTs has its value changed to 1970-01-20 06:39:05.890073
+-------------------+---------------------+---------------------------------------------------------------------+------------------------+------------------------------------------------------------------------+-------------+--------+--------------------------+-----------------------------------------------------+----------+--------+---------+----------------+-------------+
|_hoodie_commit_time|_hoodie_commit_seqno |_hoodie_record_key                                                   |_hoodie_partition_path  |_hoodie_file_name                                                       |transactionId|storeNbr|transactionTs             |messageMetadata                                      |prefixes  |dummyInt|dummyLong|dummyObjects    |transactionDt|
+-------------------+---------------------+---------------------------------------------------------------------+------------------------+------------------------------------------------------------------------+-------------+--------+--------------------------+-----------------------------------------------------+----------+--------+---------+----------------+-------------+
|20220228210614823  |20220228210614823_0_1|transactionId:abc,storeNbr:4162,transactionTs:2022-02-25 05:08:10.073|transactionDt=2022-02-25|d572dc96-ed78-46ae-8560-430d82456941-0_0-10-0_20220228210614823.parquet |abc          |4162    |1970-01-20 06:39:05.890073|{key, value, 1, 2, 1970-01-20 06:39:05.95, 1, { -> }}|[abc, def]|1       |1        |[{a, 1}, {a, 1}]|2022-02-25   |
|20220228210614823  |20220228210614823_0_2|transactionId:abc,storeNbr:4162,transactionTs:2022-02-25 05:08:10.073|transactionDt=2022-02-25|d572dc96-ed78-46ae-8560-430d82456941-0_0-10-0_20220228210614823.parquet |abc          |4162    |1970-01-20 06:39:05.890073|{key, value, 1, 2, 1970-01-20 06:40:32.35, 1, { -> }}|[abc, def]|1       |1        |[{a, 1}, {a, 1}]|2022-02-25   |
|20220228210614823  |20220228210614823_0_3|transactionId:bcd,storeNbr:4162,transactionTs:2022-02-25 05:08:10.073|transactionDt=2022-02-25|d572dc96-ed78-46ae-8560-430d82456941-0_0-10-0_20220228210614823.parquet |bcd          |4162    |1970-01-20 06:39:05.890073|{key, value, 1, 2, 1970-01-20 06:39:05.95, 1, { -> }}|[abc, def]|1       |1        |[{a, 1}, {a, 1}]|2022-02-25   |
|20220228210614823  |20220228210614823_0_4|transactionId:cde,storeNbr:4163,transactionTs:2022-02-25 05:08:10.073|transactionDt=2022-02-25|d572dc96-ed78-46ae-8560-430d82456941-0_0-10-0_20220228210614823.parquet |cde          |4163    |1970-01-20 06:39:05.890073|{key, value, 1, 2, 1970-01-20 06:39:05.95, 1, { -> }}|[abc, def]|1       |1        |[{a, 1}, {a, 1}]|2022-02-25   |
|20220228210614823  |20220228210614823_0_5|transactionId:def,storeNbr:4163,transactionTs:2022-02-25 05:08:10.073|transactionDt=2022-02-25|d572dc96-ed78-46ae-8560-430d82456941-0_0-10-0_20220228210614823.parquet |def          |4163    |1970-01-20 06:39:05.890073|{key, value, 1, 2, 1970-01-20 06:39:05.95, 1, { -> }}|[abc, def]|1       |1        |[{a, 1}, {a, 1}]|2022-02-25   |
|20220228210729355  |20220228210729355_0_1|transactionId:bcd,storeNbr:4162,transactionTs:1645745890073000       |transactionDt=2022-02-25|d572dc96-ed78-46ae-8560-430d82456941-0_0-26-20_20220228210729355.parquet|bcd          |4162    |2022-02-25 05:08:10.073   |{key, value, 1, 2, 2022-02-25 05:09:10, 1, { -> }}   |[abc, def]|1       |1        |[{a, 1}, {a, 1}]|2022-02-25   |
|20220228210729355  |20220228210729355_0_2|transactionId:cde,storeNbr:4163,transactionTs:1645745890073000       |transactionDt=2022-02-25|d572dc96-ed78-46ae-8560-430d82456941-0_0-26-20_20220228210729355.parquet|cde          |4163    |2022-02-25 05:08:10.073   |{key, value, 1, 2, 2022-02-25 05:09:10, 1, { -> }}   |[abc, def]|1       |1        |[{a, 1}, {a, 1}]|2022-02-25   |
|20220228210729355  |20220228210729355_0_3|transactionId:def,storeNbr:4163,transactionTs:1645745890073000       |transactionDt=2022-02-25|d572dc96-ed78-46ae-8560-430d82456941-0_0-26-20_20220228210729355.parquet|def          |4163    |2022-02-25 05:08:10.073   |{key, value, 1, 2, 2022-02-25 05:09:10, 1, { -> }}   |[abc, def]|1       |1        |[{a, 1}, {a, 1}]|2022-02-25   |
|20220228210729355  |20220228210729355_0_4|transactionId:abc,storeNbr:4162,transactionTs:1645745890073000       |transactionDt=2022-02-25|d572dc96-ed78-46ae-8560-430d82456941-0_0-26-20_20220228210729355.parquet|abc          |4162    |2022-02-25 05:08:10.073   |{key, value, 1, 2, 2022-02-26 05:09:10, 1, { -> }}   |[abc, def]|1       |1        |[{a, 1}, {a, 1}]|2022-02-25   |
+-------------------+---------------------+---------------------------------------------------------------------+------------------------+------------------------------------------------------------------------+-------------+--------+--------------------------+-----------------------------------------------------+----------+--------+---------+----------------+-------------+

Read more comments on GitHub >

github_iconTop Results From Across the Web

[GitHub] [hudi] mkk1490 opened a new issue #3313
For upserts, I have a problem. My primary key combo has a timestamp field in it. I have added all the required config...
Read more >
Writing Data | Apache Hudi
Record keys can either be a single column or refer to multiple columns. ... Example: Upsert a DataFrame, specifying the necessary field names...
Read more >
Troubleshooting - Apache Hudi
This error generally occurs when the schema has evolved in backwards incompatible way by deleting some column 'col1' and we are trying to...
Read more >
Writing Data - Apache Hudi
UPSERT : This is the default operation where the input records are first tagged as ... keys and timestamp based partition paths (composite...
Read more >
Configurations - Apache Hudi
These configs control the Hudi Spark Datasource, providing ability to define keys/partitioning, pick out the write operation, specify how to merge records or ......
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found