question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

[SUPPORT] Hudi Write Performance

See original GitHub issue

Hello,

I want to start using Hudi on my datalake, so I’m running some performance tests comparing current processing time with and without Hudi. We have a lot of tables in our datalake so we are processing these tables in groups in the same spark context with different threads. I made a test processing all table sources again, with regular parquet it took 15 minutes, with Hudi bulk insert 29 minutes, Hudi has some operations that regular parquet doesn’t have, for example sorting but the big performance difference was in writing parquet operation, is there any difference writing parquet with Hudi and regular parquet? I used gzip codec in both.

In Hudi I configured bulk parallelism to 20 and regular parquet I made a coalesce 20.

Hudi Version: 0.8.0-SNAPSHOT Spark Version: 3.0.1 11 Executors with 5 cores each and 35g of memory

spark submit: spark-submit --deploy-mode cluster --conf spark.executor.cores=5 --conf spark.executor.memoryOverhead=3000 --conf spark.yarn.maxAppAttempts=1 --conf spark.executor.memory=35g --conf spark.serializer=org.apache.spark.serializer.KryoSerializer --packages org.apache.spark:spark-avro_2.12:2.4.4 --jars s3://dl/lib/spark-daria_2.12-0.38.2.jar,s3://dl/lib/hudi-spark-bundle_2.12-0.8.0-SNAPSHOT.jar --class TableProcessorWrapper s3://dl/code/projects/data_projects/batch_processor_engine/batch-processor-engine_2.12-3.0.1_0.5.jar courier_api_group01

val hudiOptions = Map[String, String](
      "hoodie.table.name"                        -> tableName,
      "hoodie.datasource.write.operation"        -> "bulk_insert",
      "hoodie.bulkinsert.shuffle.parallelism"    -> "20",
      "hoodie.parquet.small.file.limit"          -> "536870912",
      "hoodie.parquet.max.file.size"             -> "1073741824",
      "hoodie.parquet.block.size"                -> "536870912",
      "hoodie.copyonwrite.record.size.estimate"  -> "1024",
      "hoodie.datasource.write.precombine.field" -> deduplicationColumn,
      "hoodie.datasource.write.recordkey.field"  -> primaryKey.mkString(","),
      "hoodie.datasource.write.keygenerator.class" -> (if (primaryKey.size == 1) {
                                                         "org.apache.hudi.keygen.SimpleKeyGenerator"
                                                       } else { "org.apache.hudi.keygen.ComplexKeyGenerator" }),
      "hoodie.datasource.write.partitionpath.field"           -> partitionColumn,
      "hoodie.datasource.write.hive_style_partitioning"       -> "true",
      "hoodie.datasource.write.table.name"                    -> tableName,
      "hoodie.datasource.hive_sync.table"                     -> tableName,
      "hoodie.datasource.hive_sync.database"                  -> databaseName,
      "hoodie.datasource.hive_sync.enable"                    -> "true",
      "hoodie.datasource.hive_sync.partition_fields"          -> partitionColumn,
      "hoodie.datasource.hive_sync.partition_extractor_class" -> "org.apache.hudi.hive.MultiPartKeysValueExtractor",
      "hoodie.datasource.hive_sync.jdbcurl"                   -> "jdbc:hive2://ip-10-0-19-157.us-west-2.compute.internal:10000"    )

Regular Parquet Captura de Tela 2021-01-24 às 12 42 13

Hudi has a Rdd conversion Part Captura de Tela 2021-01-24 às 12 45 14

Hudi Write, took double time Captura de Tela 2021-01-24 às 12 46 37 Captura de Tela 2021-01-24 às 12 47 48

It was one real world processing that I tried, but I notice this slow writing on every processing that I use Hudi.

Is it normal? Is there any way to tunning it? Am i doing something wrong?

Thank you so much!!!

Issue Analytics

  • State:closed
  • Created 3 years ago
  • Comments:6 (3 by maintainers)

github_iconTop GitHub Comments

1reaction
rubenssotocommented, Jan 25, 2021

Hello,

I changed the option hoodie.datasource.write.row.writer.enable and took only 21 minutes, 30% faster, great!!!

0reactions
vinothchandarcommented, Jan 26, 2021

yes thats correct. they are lexicographically sorted if you notice.

This is a trick we used at Uber even before Hudi. It helps layout data initially sorted, so range pruning is faster, and also when dealing with partitions with unequal size, the sort based on partitionpath ensures we are writing the smallest number of files in total. otherwise, if you hash partition 1000 times across 1000 partition paths, you ll end up with 1M files. In this approach, you will end up with atmost 2000 files. huge benefit. and from there on, when doing upserts/inserts, Hudi will maintain the file sizes.

Read more comments on GitHub >

github_iconTop Results From Across the Web

Performance | Apache Hudi
In this section, we go over some real world performance numbers for Hudi upserts, incremental pull and compare them against the conventional alternatives...
Read more >
Using Athena to query Apache Hudi datasets
Hudi handles data insertion and update events without creating many small files that can cause performance issues for analytics. Apache Hudi automatically ...
Read more >
Apache Hudi Native AWS Integrations - Onehouse
Let's discuss each AWS service to understand how the integration works and ... With EMR and Hudi you unlock two types of write...
Read more >
Delta vs Iceberg vs hudi : Reassessing Performance
TPC has had a significant impact on the database industry. “Decision support” is what the “DS” in TPC-DS stands for. There are 99...
Read more >
Building Streaming Data Lakes with Hudi and MinIO
Hudi enforces schema-on-write, consistent with the emphasis on ... as a high-performance write layer with ACID transaction support that ...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found