question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

[SUPPORT] How to deal with hard deletes in one pass

See original GitHub issue

Steps to reproduce the behavior:

  1. From Spark datasource, launch debezium-like records : they have an Op field to indicate if it’s an Insert, Update or Delete Assume that for some records, the Op value is D (delete)
  2. Then Upsert all the dataframe in a Hudi managed table, with a COPY_ON_WRITE storage type All the records with a Op = D are soft deleted (they still come up with queries but all their columns are empty - except the hudi metadata)
  3. A user query all the data of this table and count the records
  4. Its count could be false since is counting deleted operation. If It doesn’t want he should filtre on Op <> D

Expected behavior

I would expect the user not to be bother by any metadata columns. So I think we should be able to hard delete at the same time of the other upsert operations to offer a coherent view to the end user.

I already try to dig a little, asking information from here for example. My thought is that philosophically, Hudi let the user to configure operations at a dataframe level (using it with Spark). So it might be either one new configuration to specify that deletes must be hard when upserting or maybe a new kind of operation “UPSERT_WITH_DELETE”.

Regards,

Environment Description

  • Hudi version : 0.9.0

  • Spark version : 3.1.2

  • Hive version : 3.1.2

  • Hadoop version : 3.2.1

  • Storage (HDFS/S3/GCS…) : S3

  • Running on Docker? (yes/no) : no

Issue Analytics

  • State:closed
  • Created a year ago
  • Comments:5 (5 by maintainers)

github_iconTop GitHub Comments

1reaction
pratyakshsharmacommented, May 9, 2022

Going through this blog - https://hudi.apache.org/blog/2020/01/15/delete-support-in-hudi/#deletion-with-datasource, it looks like hard deletes are currently not supported along with upserts with spark datasource. This is a valid ask to introduce a new config to allow hard deletes at least with debezium source. Otherwise when using HoodieDeltaStreamer also, it is an overhead to add _hoodie_is_deleted column for hard deleting. We can probably make the experience smoother with debezium source which already has op field in its payload. Would like to hear more from @nsivabalan here.

0reactions
nsivabalancommented, Sep 12, 2022

Closing it out due to long inactivity. Already seemed to have provided the resolution. you can check out latest quick start if need be.https://hudi.apache.org/docs/quick-start-guide#deletes

Feel free to re-open or open a new issue if you need assistance.

Read more comments on GitHub >

github_iconTop Results From Across the Web

Dealing with Deletes in the Data Warehouse - LeapFrogBI
Create triggers. Use triggers to store deleted record details. This method may or may not be supported by your source system and has...
Read more >
Deleted Record Handling | Stitch Documentation
Hard deletes, which completely remove records from the source. It's as if the record never existed. If using Key-based Incremental Replication, this will...
Read more >
Soft Delete Will Complicate Your Application - Level Up Coding
Now, let us look at an alternative to soft deletion that uses hard deletion. When using this approach, records are copied to another...
Read more >
Recover from deletions in Azure Active Directory
Hard -deleted items must be re-created and reconfigured. It's best to avoid unwanted hard deletions. Review soft-deleted objects. Ensure you have ...
Read more >
Handling hard-deletes from source tables in Snapshots - Archive
Create a table that has the deleted records by using the MINUS operator (disclosure: I use Snowflake). The new table will be called...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found