Slow Write into Hudi Dataset(MOR)
See original GitHub issueHi Team,
I am reading data from Kafka and ingesting data into Hudi Dataset(MOR) using Hudi DataSource Api through Spark Structured Streaming. Pipeline Structure as like -
Kafka(Source) > Spark Structured Streaming(EMR) > MOR Hudi table(S3)
Spark - 2.4.5 Hudi - 0.5.2
I am getting performance issues while writing data into Hudi Dataset. following Hudi Jobs are taking time countByKey at HoodieBloomIndex.java countByKey at WorkloadProfile.java count at HoodieSparkSqlWriter.scala
Configuration used to write hudi data set as followed
new_df.write.format("org.apache.hudi").option("hoodie.table.name", tableName) \
.option("hoodie.datasource.write.operation", "upsert") \
.option("hoodie.datasource.write.table.type", "MERGE_ON_READ") \
.option("hoodie.datasource.write.recordkey.field", "wbn") \
.option("hoodie.datasource.write.partitionpath.field", "ad") \
.option("hoodie.datasource.write.precombine.field", "action_date") \
.option("hoodie.compact.inline", "true") \
.option("hoodie.compact.inline.max.delta.commits", "300") \
.option("hoodie.datasource.hive_sync.enable", "true") \
.option("hoodie.upsert.shuffle.parallelism", "5") \
.option("hoodie.insert.shuffle.parallelism", "5") \
.option("hoodie.bulkinsert.shuffle.parallelism", "5") \
.option("hoodie.datasource.hive_sync.table", tableName) \
.option("hoodie.datasource.hive_sync.partition_fields", "ad") \
.option("hoodie.index.type","GLOBAL_BLOOM") \
.option("hoodie.bloom.index.update.partition.path", "true") \
.option("hoodie.datasource.hive_sync.assume_date_partitioning", "false") \
.option("hoodie.datasource.hive_sync.partition_extractor_class",
"org.apache.hudi.hive.MultiPartKeysValueExtractor") \
.mode("append").save(tablePath)
Spark Submit command - spark-submit --deploy-mode client --master yarn βexecutor-memory 6g --executor-cores 1 βdriver-memory 4g βconf spark.driver.maxResultSize=2g βconf spark.executor.id=driver βconf spark.executor.instances=300 βconf spark.kryoserializer.buffer.max=512m βconf spark.shuffle.service.enabled=true βconf spark.sql.hive.convertMetastoreParquet=false βconf spark.task.cpus=1 βconf spark.yarn.driver.memoryOverhead=1024 βconf spark.yarn.executor.memoryOverhead=3072 βconf spark.yarn.max.executor.failures=100 βjars /usr/lib/hudi/hudi-spark-bundle.jar,/usr/lib/spark/external/lib/spark-avro.jar βpackages org.apache.spark:spark-sql-kafka-0-10_2.11:2.4.4 βpy-files s3://spark-test/hudi_job.py Attaching screen shot for the job details.
countByKey at HoodieBloomIndex.java
countByKey at WorkloadProfile.java
count at HoodieSparkSqlWriter.scala
Please suggest how I can tune this.
Thanks Raghvendra
Issue Analytics
- State:
- Created 3 years ago
- Comments:24 (13 by maintainers)
Top GitHub Comments
@vinothchandar I run the job with 5 min batch interval using MOR, now I can see commit duration are 5 min and compaction is also 5 min, and updated records are only 10% of total records written but now job is running with huge lag. sample commit are as below -
we have done lot of improvements around perf w/ hudi. https://hudi.apache.org/blog/2022/06/29/Apache-Hudi-vs-Delta-Lake-transparent-tpc-ds-lakehouse-performance-benchmarks
Can you try out 0.12. Also, we have written a blog around diff indexes in hudi and when to use what. https://hudi.apache.org/blog/2020/11/11/hudi-indexing-mechanisms/ Might benefit in your use-case.
Closing it out due to long inactivity. Feel free to re-open or open a new issue if you need assistance.