Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Slow Write into Hudi Dataset(MOR)

See original GitHub issue

Hi Team,

I am reading data from Kafka and ingesting data into Hudi Dataset(MOR) using Hudi DataSource Api through Spark Structured Streaming. Pipeline Structure as like -

Kafka(Source) > Spark Structured Streaming(EMR) > MOR Hudi table(S3)

Spark - 2.4.5 Hudi - 0.5.2

I am getting performance issues while writing data into Hudi Dataset. following Hudi Jobs are taking time countByKey at HoodieBloomIndex.java countByKey at WorkloadProfile.java count at HoodieSparkSqlWriter.scala

Configuration used to write hudi data set as followed

new_df.write.format("org.apache.hudi").option("hoodie.table.name", tableName) \
    .option("hoodie.datasource.write.operation", "upsert") \
    .option("hoodie.datasource.write.table.type", "MERGE_ON_READ") \
    .option("hoodie.datasource.write.recordkey.field", "wbn") \
    .option("hoodie.datasource.write.partitionpath.field", "ad") \
    .option("hoodie.datasource.write.precombine.field", "action_date") \
    .option("hoodie.compact.inline", "true") \
    .option("hoodie.compact.inline.max.delta.commits", "300") \
    .option("hoodie.datasource.hive_sync.enable", "true") \
    .option("hoodie.upsert.shuffle.parallelism", "5") \
    .option("hoodie.insert.shuffle.parallelism", "5") \
    .option("hoodie.bulkinsert.shuffle.parallelism", "5") \
    .option("hoodie.datasource.hive_sync.table", tableName) \
    .option("hoodie.datasource.hive_sync.partition_fields", "ad") \
    .option("hoodie.index.type","GLOBAL_BLOOM") \
    .option("hoodie.bloom.index.update.partition.path", "true") \
    .option("hoodie.datasource.hive_sync.assume_date_partitioning", "false") \
    .option("hoodie.datasource.hive_sync.partition_extractor_class",
            "org.apache.hudi.hive.MultiPartKeysValueExtractor") \
    .mode("append").save(tablePath)

Spark Submit command - spark-submit --deploy-mode client --master yarn –executor-memory 6g --executor-cores 1 –driver-memory 4g –conf spark.driver.maxResultSize=2g –conf spark.executor.id=driver –conf spark.executor.instances=300 –conf spark.kryoserializer.buffer.max=512m –conf spark.shuffle.service.enabled=true –conf spark.sql.hive.convertMetastoreParquet=false –conf spark.task.cpus=1 –conf spark.yarn.driver.memoryOverhead=1024 –conf spark.yarn.executor.memoryOverhead=3072 –conf spark.yarn.max.executor.failures=100 –jars /usr/lib/hudi/hudi-spark-bundle.jar,/usr/lib/spark/external/lib/spark-avro.jar –packages org.apache.spark:spark-sql-kafka-0-10_2.11:2.4.4 –py-files s3://spark-test/hudi_job.py Attaching screen shot for the job details. hudi-job

countByKey at HoodieBloomIndex.java countbykey-hoodiebloomindx countbykeyhoodiebloomindextask

countByKey at WorkloadProfile.java workloadprofiletask

count at HoodieSparkSqlWriter.scala sparksqlwritertask

Please suggest how I can tune this.

Thanks Raghvendra

Issue Analytics

State:
Created 3 years ago
Comments:24 (13 by maintainers)

Top GitHub Comments

1reaction

Raghvendradubeycommented, Jul 2, 2020

@vinothchandar I run the job with 5 min batch interval using MOR, now I can see commit duration are 5 min and compaction is also 5 min, and updated records are only 10% of total records written but now job is running with huge lag. sample commit are as below -

═══════════╗
║ CommitTime     │ Total Bytes Written │ Total Files Added │ Total Files Updated │ Total Partitions Written │ Total Records Written │ Total Update Records Written │ Total Errors ║
╠════════════════╪═════════════════════╪═══════════════════╪═════════════════════╪══════════════════════════╪═══════════════════════╪══════════════════════════════╪══════════════╣
║ 20200625112117 │ 178.0 MB            │ 1                 │ 3                   │ 2                        │ 193777                │ 18939                        │ 0            ║
╟────────────────┼─────────────────────┼───────────────────┼─────────────────────┼──────────────────────────┼───────────────────────┼──────────────────────────────┼──────────────╢
║ 20200625111810 │ 104.0 MB            │ 0                 │ 1                   │ 1                        │ 149946                │ 12619                        │ 0            ║
╟────────────────┼─────────────────────┼───────────────────┼─────────────────────┼──────────────────────────┼───────────────────────┼──────────────────────────────┼──────────────╢
║ 20200625111610 │ 211.7 MB            │ 0                 │ 3                   │ 2                        │ 259500                │ 14721                        │ 0            ║

0reactions

nsivabalancommented, Sep 12, 2022

we have done lot of improvements around perf w/ hudi. https://hudi.apache.org/blog/2022/06/29/Apache-Hudi-vs-Delta-Lake-transparent-tpc-ds-lakehouse-performance-benchmarks

Can you try out 0.12. Also, we have written a blog around diff indexes in hudi and when to use what. https://hudi.apache.org/blog/2020/11/11/hudi-indexing-mechanisms/ Might benefit in your use-case.

Closing it out due to long inactivity. Feel free to re-open or open a new issue if you need assistance.

Top Results From Across the Web

[GitHub] [hudi] vinothchandar commented on issue #1694

[GitHub] [hudi] vinothchandar commented on issue #1694: Slow Write into Hudi Dataset(MOR) · 2020-10-16 Thread GitBox. vinothchandar commented on issue ...

Performance | Apache Hudi

Partitioning can be considered a coarse form of indexing and data skipping using the col_stats partition can be thought of as a range...

Slow performance observed when inserting data into Hudi table

I am using Datasource Writer API to write 5000 records into Hudi copy-on-write table, each with 8 columns and the total size is...

ACID file formats - writing: Apache Hudi - Waiting For Code

Otherwise, the writing operation will fail because of the lack of new columns in the written dataset. And to complete the list, some...

How Hudl built a cost-optimized AWS Glue pipeline with ...

Hudl does this by expanding access to more moments through video and ... In step 4, AWS Glue now interacts with Apache HUDI...