[SUPPORT] Hudi Write Performance
See original GitHub issueHello,
I want to start using Hudi on my datalake, so I’m running some performance tests comparing current processing time with and without Hudi. We have a lot of tables in our datalake so we are processing these tables in groups in the same spark context with different threads. I made a test processing all table sources again, with regular parquet it took 15 minutes, with Hudi bulk insert 29 minutes, Hudi has some operations that regular parquet doesn’t have, for example sorting but the big performance difference was in writing parquet operation, is there any difference writing parquet with Hudi and regular parquet? I used gzip codec in both.
In Hudi I configured bulk parallelism to 20 and regular parquet I made a coalesce 20.
Hudi Version: 0.8.0-SNAPSHOT Spark Version: 3.0.1 11 Executors with 5 cores each and 35g of memory
spark submit:
spark-submit --deploy-mode cluster --conf spark.executor.cores=5 --conf spark.executor.memoryOverhead=3000 --conf spark.yarn.maxAppAttempts=1 --conf spark.executor.memory=35g --conf spark.serializer=org.apache.spark.serializer.KryoSerializer --packages org.apache.spark:spark-avro_2.12:2.4.4 --jars s3://dl/lib/spark-daria_2.12-0.38.2.jar,s3://dl/lib/hudi-spark-bundle_2.12-0.8.0-SNAPSHOT.jar --class TableProcessorWrapper s3://dl/code/projects/data_projects/batch_processor_engine/batch-processor-engine_2.12-3.0.1_0.5.jar courier_api_group01
val hudiOptions = Map[String, String](
"hoodie.table.name" -> tableName,
"hoodie.datasource.write.operation" -> "bulk_insert",
"hoodie.bulkinsert.shuffle.parallelism" -> "20",
"hoodie.parquet.small.file.limit" -> "536870912",
"hoodie.parquet.max.file.size" -> "1073741824",
"hoodie.parquet.block.size" -> "536870912",
"hoodie.copyonwrite.record.size.estimate" -> "1024",
"hoodie.datasource.write.precombine.field" -> deduplicationColumn,
"hoodie.datasource.write.recordkey.field" -> primaryKey.mkString(","),
"hoodie.datasource.write.keygenerator.class" -> (if (primaryKey.size == 1) {
"org.apache.hudi.keygen.SimpleKeyGenerator"
} else { "org.apache.hudi.keygen.ComplexKeyGenerator" }),
"hoodie.datasource.write.partitionpath.field" -> partitionColumn,
"hoodie.datasource.write.hive_style_partitioning" -> "true",
"hoodie.datasource.write.table.name" -> tableName,
"hoodie.datasource.hive_sync.table" -> tableName,
"hoodie.datasource.hive_sync.database" -> databaseName,
"hoodie.datasource.hive_sync.enable" -> "true",
"hoodie.datasource.hive_sync.partition_fields" -> partitionColumn,
"hoodie.datasource.hive_sync.partition_extractor_class" -> "org.apache.hudi.hive.MultiPartKeysValueExtractor",
"hoodie.datasource.hive_sync.jdbcurl" -> "jdbc:hive2://ip-10-0-19-157.us-west-2.compute.internal:10000" )
Regular Parquet
Hudi has a Rdd conversion Part
Hudi Write, took double time
It was one real world processing that I tried, but I notice this slow writing on every processing that I use Hudi.
Is it normal? Is there any way to tunning it? Am i doing something wrong?
Thank you so much!!!
Issue Analytics
- State:
- Created 3 years ago
- Comments:6 (3 by maintainers)
Hello,
I changed the option hoodie.datasource.write.row.writer.enable and took only 21 minutes, 30% faster, great!!!
yes thats correct. they are lexicographically sorted if you notice.
This is a trick we used at Uber even before Hudi. It helps layout data initially sorted, so range pruning is faster, and also when dealing with partitions with unequal size, the sort based on partitionpath ensures we are writing the smallest number of files in total. otherwise, if you hash partition 1000 times across 1000 partition paths, you ll end up with 1M files. In this approach, you will end up with atmost 2000 files. huge benefit. and from there on, when doing upserts/inserts, Hudi will maintain the file sizes.