Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

[SUPPORT] Question on hudi's delete statment taking too long

See original GitHub issue

I have a situation where the data is partitioned by “year”, “month” and “day” and I need to enforce uniqueness across all partitions (a Key data can change from one day to the other). My first attempt was to use by GLOBAL_INDEX, which prevents data duplication, but is not scalable: as the amount of data grows, the load time also increases.

So I am using a SIMPLE index and doing the “de-duplication” myself by identifying the rows on older partitions that are present on the insert dataset and deleting them. In other words, if I have key 123 on partition 10 and I receive key 123 again on partition 11, I delete the record from 10 and insert the one from 11.

The delete\insert steps are made with 2 calls to the df.write.format(HUDI_FORMAT)… hudi command with the difference that on the insert, I use the “hoodie.datasource.write.operation” as “upsert” and on the delete, the operation is “delete”. PS: I have also tried using an alternative approach where I use the “upsert” write operation and the “org.apache.hudi.common.model.EmptyHoodieRecordPayload" for the “hoodie.datasource.write.payload.class” – and the result is exactly the same.

I am looking for some guidance on why is the delete step taking too long. The insert (a few hundred thousand rows) happens in close to 30 seconds and the delete, which is less than 1000 rows takes more than 3 minutes. Looking at the spark logs, I see 3 jobs named “getting small files from partitions” (these are only for the delete operation) and their trace is basically the same:

org.apache.spark.rdd.RDD.isEmpty(RDD.scala:1557) org.apache.hudi.HoodieSparkSqlWriter$.commitAndPerformPostOperations(HoodieSparkSqlWriter.scala:609) org.apache.hudi.HoodieSparkSqlWriter$.write(HoodieSparkSqlWriter.scala:274) org.apache.hudi.DefaultSource.createRelation(DefaultSource.scala:164) org.apache.spark.sql.execution.datasources.SaveIntoDataSourceCommand.run(SaveIntoDataSourceCommand.scala:46) org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:70) org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:68) org.apache.spark.sql.execution.command.ExecutedCommandExec.doExecute(commands.scala:90) org.apache.spark.sql.execution.SparkPlan.$anonfun$execute$1(SparkPlan.scala:185) org.apache.spark.sql.execution.SparkPlan.$anonfun$executeQuery$1(SparkPlan.scala:223) org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151) org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:220) org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:181) org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:134) org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:133) org.apache.spark.sql.DataFrameWriter.$anonfun$runCommand$1(DataFrameWriter.scala:989) org.apache.spark.sql.catalyst.QueryPlanningTracker$.withTracker(QueryPlanningTracker.scala:107) org.apache.spark.sql.execution.SQLExecution$.withTracker(SQLExecution.scala:232) org.apache.spark.sql.execution.SQLExecution$.executeQuery$1(SQLExecution.scala:110) org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$6(SQLExecution.scala:135)

The only difference is the number of stages and tasks

Screenshot 2021-11-11 at 16 20 35 Environment Description

Any idea on why that is happening and what can I do so speed up the process?

Thanks

Hudi version : 0.9
Spark version : 3
Storage (HDFS/S3/GCS…) : Aws S3
Running on Docker? (yes/no) : no

Issue Analytics

State:
Created 2 years ago
Comments:19 (7 by maintainers)

Top GitHub Comments

1reaction

nsivabalancommented, Dec 18, 2021

@dmenin : If you are up for issuing two separate operations (delete followed by update), I might have a suggestion. How is your updates/deletes spread in general? Is it totally random spreading across all partitions and file groups or has any affinity towards few partitions.

May be you can try using BLOOM index for deletes and see how that goes. If its not randomly spread out, this will help in reducing index look up. Also, you can disable small file handling for your delete operation. https://hudi.apache.org/docs/configurations/#hoodieparquetsmallfilelimit = 0.

Let us know how your MOR exploration is going as well.

1reaction

Gatsby-Leecommented, Nov 17, 2021

Hi @xushiyan, MOR is not possible because it is not supported by AWS tools like Athena and this particular dataset has no filed guaranteed to be 100% immutable, and fields “near-immutable” would go trough the same problem. If fact the date could be considered near-immutable as on each load, I am upsetting over 100k rows and deleting only a few hundreds.

Ay other ideas on how to make the “getting small files from partitions” jobs run faster? And why are there 3 of such jobs running sequentially with different number of stages and tasks?

Thanks

Hi, I happened to see your issue. I am also using Apache Hudi in AWS Glue.

I am using MoR and I can query data through Amazon Athena.

I picked MoR over CoW since I want to prevent “hudi writing” spending time on rewriting Parquet. Do you have any reason to pick CoW over MoW?

Thank you Gatsby

Top Results From Across the Web

Delete statement in SQL is very slow - Stack Overflow

Things that can cause a delete to be slow: deleting a lot of records; many indexes; missing indexes on foreign keys in child...

sql server - delete query takes forever - DBA Stack Exchange

It seems likely the very large ntext data is highly fragmented, causing a large amount of random I/O (or other inefficiencies) when locating ......

Documentation: 15: DELETE - PostgreSQL

TRUNCATE provides a faster mechanism to remove all rows from a table. There are two ways to delete rows in a table using...

Poor performance when deleting from table

There's 1.5 mil indexed records and it is taking more than 10 min to delete about 500 records. Whereas, the delete query would...

How to Delete Just Some Rows from a Really Big Table

This lets you nibble off deletes in faster, smaller chunks, all while avoiding ugly table locks. Just keep running the DELETE statement until...

Troubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.

Start Free

Top Related Reddit Thread

No results found

Top Related Tweet

No results found

Top Related Dev.to Post

No results found

[SUPPORT] Question on hudi's delete statment taking too long

Issue Analytics

Top GitHub Comments

Top Results From Across the Web

Top Related Medium Post

Top Related StackOverflow Question

Troubleshoot Live Code

Top Related Reddit Thread

Top Related Hackernoon Post

Top Related Tweet

Top Related Dev.to Post

Top Related Hashnode Post

[SUPPORT] Upgrade from 0.8.0 to 0.9.0 removes functionality and decreases performance

[SUPPORT] Use hudi-java-client to create hudi table incorrect