Dev Observability
Product
Pricing
Docs
Resources
Blog
Company
Debug Wordle

question-mark

Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

[SUPPORT] Upsert data with an identical record key and pre-combine field

See original GitHub issue

I am using AWS DMS Change data Capture service to get change data from my database and then using Apache Hudi with AWS glue ETL job to process the change data and create a table in hive. I am using a pre-combine field as a timestamp sent from AWS DMS as when the data was committed (update_ts_dms).

I have few use cases where Insert/Updates and Deletes for the same primary key are having the same timestamp sent from DMS and after change data processing apache hudi is not giving the latest updated data in the table or last added updated row for the same primary key in the table. It is adding any random insert or update in the table. May be due to the same primary key and the same pre-combine field.

is there any suggested solution for such case?

Sample Data: { "Op": "I", "update_ts_dms": "2021-07-08 10:47:53", "id": 10125412, "brand_id": 9722520, "type": "EXPLICIT", "created": "2021-07-08 10:47:53", "updated": "2021-07-08 10:47:53" } { "Op": "D", "update_ts_dms": "2021-07-08 10:47:53", "id": 10125412, "brand_id": 9722520, "type": "EXPLICIT", "created": "2021-07-08 10:47:53", "updated": "2021-07-08 10:47:53" }

Environment Description

Hudi version (jar): hudi-spark-bundle_2.11-0.9.0-SNAPSHOT.jar
Spark version : 2.4
Hadoop version : 2.8
Storage : S3
Running on Docker?: no

Issue Analytics

State:
Created 2 years ago
Comments:14 (7 by maintainers)

Top GitHub Comments

1reaction

danny0405commented, Jul 13, 2021

See this PR: https://github.com/apache/hudi/pull/3267

0reactions

Rap70rcommented, Aug 28, 2021

Thank you @danny0405, I have tested and confirmed the issue, I described above, has been resolved for me.

Read more comments on GitHub >

Top Results From Across the Web

[SUPPORT] Hudi Upsert but with duplicates record for same key

We use COW table type but after upsert we could see lot of duplicate rows for same record key. We do set the...

All Configurations | Apache Hudi

When two records have the same key value, we will pick the one with the largest value for the precombine field, determined by...

Apache Hudi — The Basics. Features | by Parth Gupta | Medium

Pre-Combine Key. It is used to pick the latest record in case we get multiple records with same primary key. We have used...

New features from Apache Hudi 0.9.0 on Amazon EMR

Apache Hudi is an open-source transactional data lake framework that greatly simplifies incremental data processing and data pipeline ...

Apache Hudi Real-time Data Upsert (Update + Insert)

Record_key field is nothing but a 'Primary Key' in relation to the database. Pre-combined fields are used to compare two records based on...

Top Related Medium Post

No results found

Top Related StackOverflow Question

No results found

Troubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.

Top Related Reddit Thread

No results found

Top Related Hackernoon Post

No results found

Top Related Tweet

No results found

Top Related Dev.to Post

No results found

Top Related Hashnode Post

No results found

[SUPPORT] CoW: Hudi Upsert not working when there is a timestamp field in the composite key

[SUPPORT] No successful commits under path