question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

[SUPPORT] Avoid UPSERT unchanged records from source

See original GitHub issue

Problem

When the source data set has unchanged rows, Hudi will upsert the target table rows and include those records in the new commit. If you have a CDC/incremental logic where you might have identical records from previous insert, new records, and changed records. Hudi would upsert all new, changed and unchanged records - and they would all be part of a new commit.

Now when you want to query increments, the result will include lot of unnecessary (unchanged) rows as well. I would like to avoid that. Is there a way to somehow drop unchanged rows from source?

To Reproduce

Steps to reproduce the behavior:

  1. Fully load Hudi table

Target example:

---------------------------------------------------------------------
|     row_key    |     att_1      |      att_2     |    commit      |
---------------------------------------------------------------------
|        1       |      1_1       |       1_2      |        0       |
---------------------------------------------------------------------
|        2       |      2_1       |       2_2      |        0       |
---------------------------------------------------------------------
  1. Incrementally upsert new data set (Incremental data set should include unchanged records)

Incremental data:

----------------------------------------------------
|     row_key    |     att_1      |      att_2     |  
----------------------------------------------------
|        1       |      1_1       |       1_2      |
----------------------------------------------------
|        2       |      2_1       |    changed     |
----------------------------------------------------
|        3       |      3_1       |       3_2      |
----------------------------------------------------
|        4       |      4_1       |       4_2      |
----------------------------------------------------
  1. Incrementally query Hudi table for the latest commit

Target example:

---------------------------------------------------------------------
|     row_key    |     att_1      |      att_2     |    commit      |
---------------------------------------------------------------------
|        1       |      1_1       |       1_2      |        1       |
---------------------------------------------------------------------
|        2       |      2_1       |    changed     |        1       |
---------------------------------------------------------------------
|        3       |      3_1       |       3_2      |        1       |
---------------------------------------------------------------------
|        4       |      4_1       |       4_2      |        1       |
---------------------------------------------------------------------

Expected behavior

Target example:

---------------------------------------------------------------------
|     row_key    |     att_1      |      att_2     |    commit      |
---------------------------------------------------------------------
|        1       |      1_1       |       1_2      |        0       |
---------------------------------------------------------------------
|        2       |      2_1       |    changed     |        1       |
---------------------------------------------------------------------
|        3       |      3_1       |       3_2      |        1       |
---------------------------------------------------------------------
|        4       |      4_1       |       4_2      |        1       |
---------------------------------------------------------------------

Environment Description

  • Hudi version : 0.5.3
  • Spark version : 2.4.5
  • Storage (HDFS/S3/GCS…) : S3

Thank you in advance!

Issue Analytics

  • State:closed
  • Created 3 years ago
  • Reactions:2
  • Comments:15 (8 by maintainers)

github_iconTop GitHub Comments

2reactions
sleapfishcommented, Feb 3, 2021

@nsivabalan That is correct! But maybe if you have control over source but let’s say that you do something like extracting data from it with a rolling window of -3 days. And there is a case where some of the records from -3 days could change, but most of the records wouldn’t. I want to commit only changed/new records to target Hudi table.

When I do incremental query I don’t want all 3 days worth of data in it, even tho only small portion of it actually changed

0reactions
victorcadenacommented, Dec 3, 2022

One question, if the persisted record has the same Ordering Value as the Incoming Record, the record is updated, right?

https://github.com/apache/hudi/blob/master/hudi-common/src/main/java/org/apache/hudi/common/model/DefaultHoodieRecordPayload.java#L139

Is there an option to change that behavior? For me, if the ordering is the same it would be more useful to leave the old value as I don’t want to waste resources merging something unchanged. Is that a possibility?

@nsivabalan

Read more comments on GitHub >

github_iconTop Results From Across the Web

Upserting Records | Apex Developer Guide
Using the upsert operation, you can either insert or update an existing record in one call. To determine whether a record already exists,...
Read more >
SQLite - UPSERT *not* INSERT or REPLACE - Stack Overflow
Basically I want to update three out of four columns if the record exists, If it does not exists I want to INSERT...
Read more >
One record in table is keep on updating even no change in ...
For specifying UPDATE/INSERT I have calculated checksum for source and target records using ... record VERSION number is changing even no change in...
Read more >
How to only load fresh records, and leave previous rows ...
I know how to only retrieve the data if the date is >6 weeks away, but not how to store that and not...
Read more >
Does MongoDB still update or overwrite a document if the ...
The way to make sure the pre hook updatedAt functionality with updateMany can work properly is only when the query properly filters for...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found