Hudi upsert hangs
See original GitHub issueDescribe the problem you faced When we upsert data into Hudi, we’re finding that the job just hangs in some cases. Specifically, we have an ETL pipeline where we re-ingest a lot of data (i.e. we upsert data that already exists in the Hudi table). When the proportion of data that is not new is very high, the Hudi spark job seems to hang before writing out the updated table.
Note that this currently affects 2 of the 80 tables in our ETL pipeline and the rest run fine.
To Reproduce See gist at: https://gist.github.com/bwu2/89f98e0926374f71c80e4b2fa5089f18
The code there creates a Hudi table with 4m rows. It then upserts another 4m rows, 3.5m of which are the same as the original 4m.
Note that bulk parallelism of the initial load is deliberately set to 1 to ensure we avoid lots of small files.
Running this code on an EMR cluster (either interactively in a PySpark shell or spark-submit) causes the upsert job never to finish, being stuck somewhere in the Spark job with description (from the Spark history server):
count at HoodieSparkSqlWriter.scala:255
(after the stage mapToPair at HoodieWriteClient.java:492
and before/during the stage count at HoodieSparkSqlWriter.scala:255
).
For a table this small, it shouldn’t matter about cores/memory/executors/instance type but we have varied these too with no success.
Expected behavior Expected the upsert job to succeed and the total number of rows in the table to be 4.5m.
**Environment Description Running on EMR 5.29.0
-
Hudi version : tested on 0.5.0, 0.5.1 and latest build off master
-
Spark version : 2.4.4
-
Hive version : N/A
-
Hadoop version : 2.8.5 (Amazon)
-
Storage (HDFS/S3/GCS…) : S3
-
Running on Docker? (yes/no) : NO
Issue Analytics
- State:
- Created 4 years ago
- Reactions:3
- Comments:17 (11 by maintainers)
Top GitHub Comments
Fix landed on master
Reposting my response here…
There seems to be a lot of common concerns here… https://cwiki.apache.org/confluence/display/HUDI/Tuning+Guide is an useful resource, that hopefully can benefit here…
Few high level thoughts:
I would appreciate a JIRA, so that I can break each into sub-task and tackle/resolve independently…
I am personally focussing on performance now and want to make it lot faster in 0.6.0 release. So all this help would be deeply appreciated