question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Hudi upsert hangs

See original GitHub issue

Describe the problem you faced When we upsert data into Hudi, we’re finding that the job just hangs in some cases. Specifically, we have an ETL pipeline where we re-ingest a lot of data (i.e. we upsert data that already exists in the Hudi table). When the proportion of data that is not new is very high, the Hudi spark job seems to hang before writing out the updated table.

Note that this currently affects 2 of the 80 tables in our ETL pipeline and the rest run fine.

To Reproduce See gist at: https://gist.github.com/bwu2/89f98e0926374f71c80e4b2fa5089f18

The code there creates a Hudi table with 4m rows. It then upserts another 4m rows, 3.5m of which are the same as the original 4m.

Note that bulk parallelism of the initial load is deliberately set to 1 to ensure we avoid lots of small files.

Running this code on an EMR cluster (either interactively in a PySpark shell or spark-submit) causes the upsert job never to finish, being stuck somewhere in the Spark job with description (from the Spark history server): count at HoodieSparkSqlWriter.scala:255 (after the stage mapToPair at HoodieWriteClient.java:492 and before/during the stage count at HoodieSparkSqlWriter.scala:255).

For a table this small, it shouldn’t matter about cores/memory/executors/instance type but we have varied these too with no success.

Expected behavior Expected the upsert job to succeed and the total number of rows in the table to be 4.5m.

**Environment Description Running on EMR 5.29.0

  • Hudi version : tested on 0.5.0, 0.5.1 and latest build off master

  • Spark version : 2.4.4

  • Hive version : N/A

  • Hadoop version : 2.8.5 (Amazon)

  • Storage (HDFS/S3/GCS…) : S3

  • Running on Docker? (yes/no) : NO

Issue Analytics

  • State:closed
  • Created 4 years ago
  • Reactions:3
  • Comments:17 (11 by maintainers)

github_iconTop GitHub Comments

2reactions
vinothchandarcommented, Mar 3, 2020

Fix landed on master

1reaction
vinothchandarcommented, Feb 13, 2020

Reposting my response here…

There seems to be a lot of common concerns here… https://cwiki.apache.org/confluence/display/HUDI/Tuning+Guide is an useful resource, that hopefully can benefit here…

Few high level thoughts:

  • It would be good to layout if the most time spent is on the indexing stages (ones tagged with HoodieBloomIndex) or the actual writing…
  • Hudi does keep the input in memory to compute the stats it needs to size files. So if you don’t provide sufficient executore/rdd storage memory, it will spill and can cause slowdowns… (covered in tuning guide & have seen this happen with users often)
  • On workload pattern itself, BloomIndex range pruning can be turned off https://hudi.apache.org/docs/configurations.html#bloomIndexPruneByRanges if the keys ranges are random anyway… Generally speaking, unless we have RFC-8 (record level indexing), cases of random write/upserting majority of the rows in a table, may give bloom index overhead, since the bloom filters/ranges are not at all useful in pruning out files . We have an interim solution coming out in the next release… falling back to plain old join to implement the indexing.
  • In terms or MOR and COW, MOR will help only if you have lots of updates and bottleneck is on the writing…
  • If listing is an issue, please turn the following so the table is listed once and we re-use the filesytem metadata hoodie.embed.timeline.server=true

I would appreciate a JIRA, so that I can break each into sub-task and tackle/resolve independently…

I am personally focussing on performance now and want to make it lot faster in 0.6.0 release. So all this help would be deeply appreciated

Read more comments on GitHub >

github_iconTop Results From Across the Web

Troubleshooting - Apache Hudi
Typical upsert() DAG looks like below. ... Job 1 : Triggers the input data read, converts to HoodieRecord object and then stops at...
Read more >
[jira] [Updated] (HUDI-3375) Investigate deltastreamer ...
[ https://issues.apache.org/jira/browse/HUDI-3375?page=com. ... deltastreamer continuous mode getting stuck when metadata table > is enabled ...
Read more >
New features from Apache Hudi 0.7.0 and 0.8.0 available on ...
Since the inclusion of Apache Hudi within Amazon EMR, there has been several ... travel, binge watch, and hang out with friends.
Read more >
Apache Hudi — The Streaming Data Lake Platform - Medium
This is however not sufficient for Hudi to realize fast upserts. ... query that starts and stops at an older portion of the...
Read more >
Hudi Fails to Write Decimal Data with Lower ... - 华为云
Decimal data is initially written to a Hudi table using the BULK_INSERT command. Then when data is subsequently written using UPSERT, the following...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found