question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

[SUPPORT] Upsert data with an identical record key and pre-combine field

See original GitHub issue

I am using AWS DMS Change data Capture service to get change data from my database and then using Apache Hudi with AWS glue ETL job to process the change data and create a table in hive. I am using a pre-combine field as a timestamp sent from AWS DMS as when the data was committed (update_ts_dms).

I have few use cases where Insert/Updates and Deletes for the same primary key are having the same timestamp sent from DMS and after change data processing apache hudi is not giving the latest updated data in the table or last added updated row for the same primary key in the table. It is adding any random insert or update in the table. May be due to the same primary key and the same pre-combine field.

is there any suggested solution for such case?

Sample Data: { "Op": "I", "update_ts_dms": "2021-07-08 10:47:53", "id": 10125412, "brand_id": 9722520, "type": "EXPLICIT", "created": "2021-07-08 10:47:53", "updated": "2021-07-08 10:47:53" } { "Op": "D", "update_ts_dms": "2021-07-08 10:47:53", "id": 10125412, "brand_id": 9722520, "type": "EXPLICIT", "created": "2021-07-08 10:47:53", "updated": "2021-07-08 10:47:53" }

Environment Description

  • Hudi version (jar): hudi-spark-bundle_2.11-0.9.0-SNAPSHOT.jar

  • Spark version : 2.4

  • Hadoop version : 2.8

  • Storage : S3

  • Running on Docker?: no

Issue Analytics

  • State:closed
  • Created 2 years ago
  • Comments:14 (7 by maintainers)

github_iconTop GitHub Comments

1reaction
danny0405commented, Jul 13, 2021
0reactions
Rap70rcommented, Aug 28, 2021

Thank you @danny0405, I have tested and confirmed the issue, I described above, has been resolved for me.

Read more comments on GitHub >

github_iconTop Results From Across the Web

[SUPPORT] Hudi Upsert but with duplicates record for same key
We use COW table type but after upsert we could see lot of duplicate rows for same record key. We do set the...
Read more >
All Configurations | Apache Hudi
When two records have the same key value, we will pick the one with the largest value for the precombine field, determined by...
Read more >
Apache Hudi — The Basics. Features | by Parth Gupta | Medium
Pre-Combine Key. It is used to pick the latest record in case we get multiple records with same primary key. We have used...
Read more >
New features from Apache Hudi 0.9.0 on Amazon EMR
Apache Hudi is an open-source transactional data lake framework that greatly simplifies incremental data processing and data pipeline ...
Read more >
Apache Hudi Real-time Data Upsert (Update + Insert)
Record_key field is nothing but a 'Primary Key' in relation to the database. Pre-combined fields are used to compare two records based on...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found