[SUPPORT] How to deal with hard deletes in one pass
See original GitHub issueSteps to reproduce the behavior:
- From Spark datasource, launch debezium-like records : they have an
Op
field to indicate if it’s an Insert, Update or Delete Assume that for some records, theOp
value isD
(delete) - Then Upsert all the dataframe in a Hudi managed table, with a
COPY_ON_WRITE
storage type All the records with aOp
=D
are soft deleted (they still come up with queries but all their columns are empty - except the hudi metadata) - A user query all the data of this table and count the records
- Its count could be false since is counting deleted operation. If It doesn’t want he should filtre on
Op
<>D
Expected behavior
I would expect the user not to be bother by any metadata columns. So I think we should be able to hard delete at the same time of the other upsert operations to offer a coherent view to the end user.
I already try to dig a little, asking information from here for example. My thought is that philosophically, Hudi let the user to configure operations at a dataframe level (using it with Spark). So it might be either one new configuration to specify that deletes must be hard when upserting or maybe a new kind of operation “UPSERT_WITH_DELETE”.
Regards,
Environment Description
-
Hudi version : 0.9.0
-
Spark version : 3.1.2
-
Hive version : 3.1.2
-
Hadoop version : 3.2.1
-
Storage (HDFS/S3/GCS…) : S3
-
Running on Docker? (yes/no) : no
Issue Analytics
- State:
- Created a year ago
- Comments:5 (5 by maintainers)
Top GitHub Comments
Going through this blog - https://hudi.apache.org/blog/2020/01/15/delete-support-in-hudi/#deletion-with-datasource, it looks like hard deletes are currently not supported along with upserts with spark datasource. This is a valid ask to introduce a new config to allow hard deletes at least with debezium source. Otherwise when using HoodieDeltaStreamer also, it is an overhead to add _hoodie_is_deleted column for hard deleting. We can probably make the experience smoother with debezium source which already has op field in its payload. Would like to hear more from @nsivabalan here.
Closing it out due to long inactivity. Already seemed to have provided the resolution. you can check out latest quick start if need be.https://hudi.apache.org/docs/quick-start-guide#deletes
Feel free to re-open or open a new issue if you need assistance.