[SUPPORT] Upsert data with an identical record key and pre-combine field
See original GitHub issueI am using AWS DMS Change data Capture service to get change data from my database and then using Apache Hudi with AWS glue ETL job to process the change data and create a table in hive. I am using a pre-combine field as a timestamp sent from AWS DMS as when the data was committed (update_ts_dms).
I have few use cases where Insert/Updates and Deletes for the same primary key are having the same timestamp sent from DMS and after change data processing apache hudi is not giving the latest updated data in the table or last added updated row for the same primary key in the table. It is adding any random insert or update in the table. May be due to the same primary key and the same pre-combine field.
is there any suggested solution for such case?
Sample Data:
{ "Op": "I", "update_ts_dms": "2021-07-08 10:47:53", "id": 10125412, "brand_id": 9722520, "type": "EXPLICIT", "created": "2021-07-08 10:47:53", "updated": "2021-07-08 10:47:53" } { "Op": "D", "update_ts_dms": "2021-07-08 10:47:53", "id": 10125412, "brand_id": 9722520, "type": "EXPLICIT", "created": "2021-07-08 10:47:53", "updated": "2021-07-08 10:47:53" }
Environment Description
-
Hudi version (jar): hudi-spark-bundle_2.11-0.9.0-SNAPSHOT.jar
-
Spark version : 2.4
-
Hadoop version : 2.8
-
Storage : S3
-
Running on Docker?: no
Issue Analytics
- State:
- Created 2 years ago
- Comments:14 (7 by maintainers)
Top GitHub Comments
See this PR: https://github.com/apache/hudi/pull/3267
Thank you @danny0405, I have tested and confirmed the issue, I described above, has been resolved for me.