question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. ItΒ collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Slow Write into Hudi Dataset(MOR)

See original GitHub issue

Hi Team,

I am reading data from Kafka and ingesting data into Hudi Dataset(MOR) using Hudi DataSource Api through Spark Structured Streaming. Pipeline Structure as like -

Kafka(Source) > Spark Structured Streaming(EMR) > MOR Hudi table(S3)

Spark - 2.4.5 Hudi - 0.5.2

I am getting performance issues while writing data into Hudi Dataset. following Hudi Jobs are taking time countByKey at HoodieBloomIndex.java countByKey at WorkloadProfile.java count at HoodieSparkSqlWriter.scala

Configuration used to write hudi data set as followed

new_df.write.format("org.apache.hudi").option("hoodie.table.name", tableName) \
    .option("hoodie.datasource.write.operation", "upsert") \
    .option("hoodie.datasource.write.table.type", "MERGE_ON_READ") \
    .option("hoodie.datasource.write.recordkey.field", "wbn") \
    .option("hoodie.datasource.write.partitionpath.field", "ad") \
    .option("hoodie.datasource.write.precombine.field", "action_date") \
    .option("hoodie.compact.inline", "true") \
    .option("hoodie.compact.inline.max.delta.commits", "300") \
    .option("hoodie.datasource.hive_sync.enable", "true") \
    .option("hoodie.upsert.shuffle.parallelism", "5") \
    .option("hoodie.insert.shuffle.parallelism", "5") \
    .option("hoodie.bulkinsert.shuffle.parallelism", "5") \
    .option("hoodie.datasource.hive_sync.table", tableName) \
    .option("hoodie.datasource.hive_sync.partition_fields", "ad") \
    .option("hoodie.index.type","GLOBAL_BLOOM") \
    .option("hoodie.bloom.index.update.partition.path", "true") \
    .option("hoodie.datasource.hive_sync.assume_date_partitioning", "false") \
    .option("hoodie.datasource.hive_sync.partition_extractor_class",
            "org.apache.hudi.hive.MultiPartKeysValueExtractor") \
    .mode("append").save(tablePath)

Spark Submit command - spark-submit --deploy-mode client --master yarn –executor-memory 6g --executor-cores 1 –driver-memory 4g –conf spark.driver.maxResultSize=2g –conf spark.executor.id=driver –conf spark.executor.instances=300 –conf spark.kryoserializer.buffer.max=512m –conf spark.shuffle.service.enabled=true –conf spark.sql.hive.convertMetastoreParquet=false –conf spark.task.cpus=1 –conf spark.yarn.driver.memoryOverhead=1024 –conf spark.yarn.executor.memoryOverhead=3072 –conf spark.yarn.max.executor.failures=100 –jars /usr/lib/hudi/hudi-spark-bundle.jar,/usr/lib/spark/external/lib/spark-avro.jar –packages org.apache.spark:spark-sql-kafka-0-10_2.11:2.4.4 –py-files s3://spark-test/hudi_job.py Attaching screen shot for the job details. hudi-job

countByKey at HoodieBloomIndex.java countbykey-hoodiebloomindx countbykeyhoodiebloomindextask

countByKey at WorkloadProfile.java workloadprofile workloadprofiletask

count at HoodieSparkSqlWriter.scala hoodiesparksqlwriter sparksqlwritertask

Please suggest how I can tune this.

Thanks Raghvendra

Issue Analytics

  • State:closed
  • Created 3 years ago
  • Comments:24 (13 by maintainers)

github_iconTop GitHub Comments

1reaction
Raghvendradubeycommented, Jul 2, 2020

@vinothchandar I run the job with 5 min batch interval using MOR, now I can see commit duration are 5 min and compaction is also 5 min, and updated records are only 10% of total records written but now job is running with huge lag. sample commit are as below -

═══════════╗
β•‘ CommitTime     β”‚ Total Bytes Written β”‚ Total Files Added β”‚ Total Files Updated β”‚ Total Partitions Written β”‚ Total Records Written β”‚ Total Update Records Written β”‚ Total Errors β•‘
╠════════════════β•ͺ═════════════════════β•ͺ═══════════════════β•ͺ═════════════════════β•ͺ══════════════════════════β•ͺ═══════════════════════β•ͺ══════════════════════════════β•ͺ══════════════╣
β•‘ 20200625112117 β”‚ 178.0 MB            β”‚ 1                 β”‚ 3                   β”‚ 2                        β”‚ 193777                β”‚ 18939                        β”‚ 0            β•‘
β•Ÿβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β•’
β•‘ 20200625111810 β”‚ 104.0 MB            β”‚ 0                 β”‚ 1                   β”‚ 1                        β”‚ 149946                β”‚ 12619                        β”‚ 0            β•‘
β•Ÿβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β•’
β•‘ 20200625111610 β”‚ 211.7 MB            β”‚ 0                 β”‚ 3                   β”‚ 2                        β”‚ 259500                β”‚ 14721                        β”‚ 0            β•‘
0reactions
nsivabalancommented, Sep 12, 2022

we have done lot of improvements around perf w/ hudi. https://hudi.apache.org/blog/2022/06/29/Apache-Hudi-vs-Delta-Lake-transparent-tpc-ds-lakehouse-performance-benchmarks

Can you try out 0.12. Also, we have written a blog around diff indexes in hudi and when to use what. https://hudi.apache.org/blog/2020/11/11/hudi-indexing-mechanisms/ Might benefit in your use-case.

Closing it out due to long inactivity. Feel free to re-open or open a new issue if you need assistance.

Read more comments on GitHub >

github_iconTop Results From Across the Web

[GitHub] [hudi] vinothchandar commented on issue #1694
[GitHub] [hudi] vinothchandar commented on issue #1694: Slow Write into Hudi Dataset(MOR) Β· 2020-10-16 Thread GitBox. vinothchandar commented on issueΒ ...
Read more >
Performance | Apache Hudi
Partitioning can be considered a coarse form of indexing and data skipping using the col_stats partition can be thought of as a range...
Read more >
Slow performance observed when inserting data into Hudi table
I am using Datasource Writer API to write 5000 records into Hudi copy-on-write table, each with 8 columns and the total size is...
Read more >
ACID file formats - writing: Apache Hudi - Waiting For Code
Otherwise, the writing operation will fail because of the lack of new columns in the written dataset. And to complete the list, some...
Read more >
How Hudl built a cost-optimized AWS Glue pipeline with ...
Hudl does this by expanding access to more moments through video and ... In step 4, AWS Glue now interacts with Apache HUDI...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found