[SUPPORT] Avoid UPSERT unchanged records from source
See original GitHub issueProblem
When the source data set has unchanged rows, Hudi will upsert the target table rows and include those records in the new commit. If you have a CDC/incremental logic where you might have identical records from previous insert, new records, and changed records. Hudi would upsert all new, changed and unchanged records - and they would all be part of a new commit.
Now when you want to query increments, the result will include lot of unnecessary (unchanged) rows as well. I would like to avoid that. Is there a way to somehow drop unchanged rows from source?
To Reproduce
Steps to reproduce the behavior:
- Fully load Hudi table
Target example:
---------------------------------------------------------------------
| row_key | att_1 | att_2 | commit |
---------------------------------------------------------------------
| 1 | 1_1 | 1_2 | 0 |
---------------------------------------------------------------------
| 2 | 2_1 | 2_2 | 0 |
---------------------------------------------------------------------
- Incrementally upsert new data set (Incremental data set should include unchanged records)
Incremental data:
----------------------------------------------------
| row_key | att_1 | att_2 |
----------------------------------------------------
| 1 | 1_1 | 1_2 |
----------------------------------------------------
| 2 | 2_1 | changed |
----------------------------------------------------
| 3 | 3_1 | 3_2 |
----------------------------------------------------
| 4 | 4_1 | 4_2 |
----------------------------------------------------
- Incrementally query Hudi table for the latest commit
Target example:
---------------------------------------------------------------------
| row_key | att_1 | att_2 | commit |
---------------------------------------------------------------------
| 1 | 1_1 | 1_2 | 1 |
---------------------------------------------------------------------
| 2 | 2_1 | changed | 1 |
---------------------------------------------------------------------
| 3 | 3_1 | 3_2 | 1 |
---------------------------------------------------------------------
| 4 | 4_1 | 4_2 | 1 |
---------------------------------------------------------------------
Expected behavior
Target example:
---------------------------------------------------------------------
| row_key | att_1 | att_2 | commit |
---------------------------------------------------------------------
| 1 | 1_1 | 1_2 | 0 |
---------------------------------------------------------------------
| 2 | 2_1 | changed | 1 |
---------------------------------------------------------------------
| 3 | 3_1 | 3_2 | 1 |
---------------------------------------------------------------------
| 4 | 4_1 | 4_2 | 1 |
---------------------------------------------------------------------
Environment Description
- Hudi version : 0.5.3
- Spark version : 2.4.5
- Storage (HDFS/S3/GCS…) : S3
Thank you in advance!
Issue Analytics
- State:
- Created 3 years ago
- Reactions:2
- Comments:15 (8 by maintainers)
Top Results From Across the Web
Upserting Records | Apex Developer Guide
Using the upsert operation, you can either insert or update an existing record in one call. To determine whether a record already exists,...
Read more >SQLite - UPSERT *not* INSERT or REPLACE - Stack Overflow
Basically I want to update three out of four columns if the record exists, If it does not exists I want to INSERT...
Read more >One record in table is keep on updating even no change in ...
For specifying UPDATE/INSERT I have calculated checksum for source and target records using ... record VERSION number is changing even no change in...
Read more >How to only load fresh records, and leave previous rows ...
I know how to only retrieve the data if the date is >6 weeks away, but not how to store that and not...
Read more >Does MongoDB still update or overwrite a document if the ...
The way to make sure the pre hook updatedAt functionality with updateMany can work properly is only when the query properly filters for...
Read more >Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start FreeTop Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Top GitHub Comments
@nsivabalan That is correct! But maybe if you have control over source but let’s say that you do something like extracting data from it with a rolling window of -3 days. And there is a case where some of the records from -3 days could change, but most of the records wouldn’t. I want to commit only changed/new records to target Hudi table.
When I do incremental query I don’t want all 3 days worth of data in it, even tho only small portion of it actually changed
One question, if the persisted record has the same Ordering Value as the Incoming Record, the record is updated, right?
https://github.com/apache/hudi/blob/master/hudi-common/src/main/java/org/apache/hudi/common/model/DefaultHoodieRecordPayload.java#L139
Is there an option to change that behavior? For me, if the ordering is the same it would be more useful to leave the old value as I don’t want to waste resources merging something unchanged. Is that a possibility?
@nsivabalan