Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

[SUPPORT] Upgrade from 0.8.0 to 0.10.0 decreases Upsert performance

See original GitHub issue

Describe the problem you faced

Recently, we upgraded our testing environment from Hudi 0.8.0 to Hudi 0.10.0, and after the upgrade we noticed that upsert jobs for some of our existing tables run much slower compared to how they ran in Hudi 0.8.0.

For our Hudi tables, we ran one bulk_insert job to ingest snapshot, and schedule an upsert job every 10 mins to ingest incremental updates after the completion of bulk_insert job.

To reproduce the issue, we ran upsert job on a table with the size around 1.8T. The job took in 11 tsv files (< 150 MB in total) containing both new records and updates.

In Hudi 0.8.0, the job took 8.5 mins to complete whereas in Hudi 0.10.0, the job took 19 mins. And we notice that the main difference seemed to come from the steps “Getting small files from partitions”.

0.8.0 0_8_0

0.10.0 0_10_0

We also ran the same upsert job as a fresh table that has no pre-existing snapshot and incremental data, and the job in both 0.8.0 and 0.10.0 took around 8 mins to complete.

Based on the result, we speculate that in Hudi 0.10.0, the upsert performance somehow drops as more upsert jobs completed which make the size of the table grow, whereas in Hudi 0.8.0, we didn’t notice such kind of performance degradation.

Environment Description

  • Hudi version : 0.10.0
  • Spark version : 2.4.7
  • Hive version : 2.3.7
  • Hadoop version : 2.10.1
  • Storage (HDFS/S3/GCS…) : S3
  • Running on Docker? (yes/no) : no
  • AWS EMR: 5.33.0, 1 master(r6g.16xlarge) with 20 cores(r6g.16xlarge)

Additional context

Spark configs:

--deploy-mode cluster
--executor-memory 43g
--driver-memory 43g
--executor-cores 6
--conf spark.serializer=org.apache.spark.serializer.KryoSerializer
--conf spark.sql.hive.convertMetastoreParquet=false
--conf spark.hadoop.fs.s3.maxRetries=30
--conf spark.yarn.executor.memoryOverhead=5g

Hudi configs:

hoodie.consistency.check.enabled -> true
hoodie.datasource.write.table.type -> "COPY_ON_WRITE"
hoodie.datasource.write.keygenerator.class -> "org.apache.hudi.keygen.ComplexKeyGenerator"
hoodie.upsert.shuffle.parallelism -> 1500
hoodie.parquet.max.file.size -> 500 * 1024 * 1024
hoodie.datasource.write.operation -> "upsert"
hoodie.metadata.enable -> true
hoodie.metadata.validate -> true -> false
hoodie.clean.automatic -> true
hoodie.cleaner.commits.retained: 72
hoodie.keep.min.commits: 100
hoodie.keep.max.commits: 150

Please let me know if you need any more information, thanks.

Issue Analytics

  • State:open
  • Created 2 years ago
  • Reactions:2
  • Comments:11 (5 by maintainers)

github_iconTop GitHub Comments

tjtollcommented, Mar 23, 2022

Good morning,

We are experiencing the same issue with .10 and .9 (see UI below). Also using S3, but using AWS Glue not EMR. What stands out to me are the 3 consecutive ‘Getting small files from partitions’ stages that with 4, 20, 100 tasks respectively. The stages with 4 and 20 tasks obviously getting very poor parallelization. The identical behavior exists on my UI and ChiehFu’s


tjtollcommented, Sep 14, 2022

@nsivabalan we still see these behavior but I didn’t see this in slack. The stages in the Spark UI are more clear in versions after 0.9.0 but we are still using that version

Read more comments on GitHub >

github_iconTop Results From Across the Web

Release 0.10.0 | Apache Hudi
With 0.10.0, we have made some foundational fix to metadata table and so as part of upgrade, any existing metadata table is cleaned...
Read more >
Cisco Firepower Release Notes, Version 6.4 - Upgrade the ...
Upgrade the Software. ... 6.2.0+ by default get their latency-based performance settings from the latest intrusion rule update (SRU).
Read more >
Upgrade Guides | Nomad - HashiCorp Developer
Nomad 0.10.0. Deployments. Nomad 0.10 enables rolling deployments for service jobs by default and adds a default update stanza ...
Read more >
New features from Apache Hudi 0.7.0 and 0.8.0 available on ...
However, faster data ingestion often leads to smaller data file sizes that often adversely affects query performance, because a large number of ...
Read more >
Upgrading - Apache Kafka - CWIKI.US
For a rolling upgrade: Update on all brokers and add the following properties. CURRENT_KAFKA_VERSION refers to the version you ...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Post

No results found

github_iconTop Related Hashnode Post

No results found