Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

[SUPPORT] Hudi Write Performance

See original GitHub issue

Hello,

I want to start using Hudi on my datalake, so I’m running some performance tests comparing current processing time with and without Hudi. We have a lot of tables in our datalake so we are processing these tables in groups in the same spark context with different threads. I made a test processing all table sources again, with regular parquet it took 15 minutes, with Hudi bulk insert 29 minutes, Hudi has some operations that regular parquet doesn’t have, for example sorting but the big performance difference was in writing parquet operation, is there any difference writing parquet with Hudi and regular parquet? I used gzip codec in both.

In Hudi I configured bulk parallelism to 20 and regular parquet I made a coalesce 20.

Hudi Version: 0.8.0-SNAPSHOT Spark Version: 3.0.1 11 Executors with 5 cores each and 35g of memory

spark submit: spark-submit --deploy-mode cluster --conf spark.executor.cores=5 --conf spark.executor.memoryOverhead=3000 --conf spark.yarn.maxAppAttempts=1 --conf spark.executor.memory=35g --conf spark.serializer=org.apache.spark.serializer.KryoSerializer --packages org.apache.spark:spark-avro_2.12:2.4.4 --jars s3://dl/lib/spark-daria_2.12-0.38.2.jar,s3://dl/lib/hudi-spark-bundle_2.12-0.8.0-SNAPSHOT.jar --class TableProcessorWrapper s3://dl/code/projects/data_projects/batch_processor_engine/batch-processor-engine_2.12-3.0.1_0.5.jar courier_api_group01

val hudiOptions = Map[String, String](
      "hoodie.table.name"                        -> tableName,
      "hoodie.datasource.write.operation"        -> "bulk_insert",
      "hoodie.bulkinsert.shuffle.parallelism"    -> "20",
      "hoodie.parquet.small.file.limit"          -> "536870912",
      "hoodie.parquet.max.file.size"             -> "1073741824",
      "hoodie.parquet.block.size"                -> "536870912",
      "hoodie.copyonwrite.record.size.estimate"  -> "1024",
      "hoodie.datasource.write.precombine.field" -> deduplicationColumn,
      "hoodie.datasource.write.recordkey.field"  -> primaryKey.mkString(","),
      "hoodie.datasource.write.keygenerator.class" -> (if (primaryKey.size == 1) {
                                                         "org.apache.hudi.keygen.SimpleKeyGenerator"
                                                       } else { "org.apache.hudi.keygen.ComplexKeyGenerator" }),
      "hoodie.datasource.write.partitionpath.field"           -> partitionColumn,
      "hoodie.datasource.write.hive_style_partitioning"       -> "true",
      "hoodie.datasource.write.table.name"                    -> tableName,
      "hoodie.datasource.hive_sync.table"                     -> tableName,
      "hoodie.datasource.hive_sync.database"                  -> databaseName,
      "hoodie.datasource.hive_sync.enable"                    -> "true",
      "hoodie.datasource.hive_sync.partition_fields"          -> partitionColumn,
      "hoodie.datasource.hive_sync.partition_extractor_class" -> "org.apache.hudi.hive.MultiPartKeysValueExtractor",
      "hoodie.datasource.hive_sync.jdbcurl"                   -> "jdbc:hive2://ip-10-0-19-157.us-west-2.compute.internal:10000"    )

Regular Parquet Captura de Tela 2021-01-24 às 12 42 13

Hudi has a Rdd conversion Part Captura de Tela 2021-01-24 às 12 45 14

Hudi Write, took double time Captura de Tela 2021-01-24 às 12 46 37

It was one real world processing that I tried, but I notice this slow writing on every processing that I use Hudi.

Is it normal? Is there any way to tunning it? Am i doing something wrong?

Thank you so much!!!

Issue Analytics

State:
Created 3 years ago
Comments:6 (3 by maintainers)

Top GitHub Comments

1reaction

rubenssotocommented, Jan 25, 2021

Hello,

I changed the option hoodie.datasource.write.row.writer.enable and took only 21 minutes, 30% faster, great!!!

0reactions

vinothchandarcommented, Jan 26, 2021

yes thats correct. they are lexicographically sorted if you notice.

This is a trick we used at Uber even before Hudi. It helps layout data initially sorted, so range pruning is faster, and also when dealing with partitions with unequal size, the sort based on partitionpath ensures we are writing the smallest number of files in total. otherwise, if you hash partition 1000 times across 1000 partition paths, you ll end up with 1M files. In this approach, you will end up with atmost 2000 files. huge benefit. and from there on, when doing upserts/inserts, Hudi will maintain the file sizes.

Top Results From Across the Web

Performance | Apache Hudi

In this section, we go over some real world performance numbers for Hudi upserts, incremental pull and compare them against the conventional alternatives...

Using Athena to query Apache Hudi datasets

Hudi handles data insertion and update events without creating many small files that can cause performance issues for analytics. Apache Hudi automatically ...

Apache Hudi Native AWS Integrations - Onehouse

Let's discuss each AWS service to understand how the integration works and ... With EMR and Hudi you unlock two types of write...

Delta vs Iceberg vs hudi : Reassessing Performance

TPC has had a significant impact on the database industry. “Decision support” is what the “DS” in TPC-DS stands for. There are 99...

Building Streaming Data Lakes with Hudi and MinIO

Hudi enforces schema-on-write, consistent with the emphasis on ... as a high-performance write layer with ACID transaction support that ...