[SUPPORT] Slow insert into COW tables with multi level partitions
See original GitHub issueHi there,
I have a problem with writing when there are many partitions.
Data
The DataFrame i’m working with is pretty simple. It contains all the messages sent by the organizations.
message_id | timestamp | status | organization_id | year | month | day |
---|---|---|---|---|---|---|
bdabfa6f-2a3e-4c17-acd7-350227473ae4 | 2020-11-25T10:00:00Z | SENT | 0b38bec3-15ac-4e57-9bb9-48d7de412ffa | 2020 | 11 | 25 |
203d5495-9b5d-4003-b7f3-ab312a70db40 | 2020-11-25T11:00:00Z | SENT | 75e498d4-c979-4a12-b8df-1051c7976d34 | 2020 | 11 | 24 |
09fa0543-cf5a-4e6b-9d16-ad14a8a7058a | 2020-10-22T09:00:00Z | NOT_SENT | 0b38bec3-15ac-4e57-9bb9-48d7de412ffa | 2020 | 10 | 22 |
Previous Scenario (Good performance)
Writing was taking about a 40 seconds to write 100k rows when partitioning contained year, month and day only. But this kind of partitioning was not the best for my use case.
Partitions
year=2020/month=10/day=22
year=2020/month=11/day=24
year=2020/month=11/day=25
Current Scenario (Bad performance)
I changed partitioning to have organization_id before the other partitions. In this approach the writing time has increased considerably and now it takes about 5 minutes to write 100k rows. There are hundreds of organizations and each of them will have a partition per day.
Partitions
organization_id=0b38bec3-15ac-4e57-9bb9-48d7de412ffa/year=2020/month=10/day=22
organization_id=75e498d4-c979-4a12-b8df-1051c7976d34/year=2020/month=11/day=24
organization_id=0b38bec3-15ac-4e57-9bb9-48d7de412ffa/year=2020/month=11/day=25
Execution time
Configuration
Both approaches use the same configs. Some settings were based on the Tuning Guide.
Hudi Configs
"hoodie.datasource.write.insert.drop.duplicates" -> "true"
"hoodie.insert.shuffle.parallelism" -> "27"
"hoodie.finalize.write.parallelism" -> "27"
"hoodie.datasource.write.recordkey.field" -> "message_id"
"hoodie.datasource.write.precombine.field" -> "timestamp"
"hoodie.datasource.write.partitionpath.field" -> "organization_id,year,month,day"
"hoodie.datasource.write.keygenerator.class" -> classOf[ComplexKeyGenerator].getName
Spark submit configs
...
--driver-memory 4G \
--executor-memory 8G \
--executor-cores 3 \
--num-executors 3 \
...
--conf spark.driver.memoryOverhead=1024 \
--conf spark.executor.memoryOverhead=2048 \
--conf spark.memory.fraction=0.9 \
--conf spark.memory.storageFraction=0.2 \
--conf spark.sql.shuffle.partitions=27 \
--conf spark.default.parallelism=27 \
--conf spark.sql.hive.convertMetastoreParquet=false \
--conf spark.serializer=org.apache.spark.serializer.KryoSerializer \
--conf spark.rdd.compress=true \
--conf spark.shuffle.service.enabled=true \
--conf spark.executor.extraJavaOptions="-XX:NewSize=1g -XX:SurvivorRatio=2 -XX:+UseCompressedOops -XX:+UseConcMarkSweepGC -XX:+UseParNewGC -XX:CMSInitiatingOccupancyFraction=70 -XX:+PrintGCDetails -XX:+PrintGCTimeStamps -XX:+PrintGCDateStamps -XX:+PrintGCApplicationStoppedTime -XX:+PrintGCApplicationConcurrentTime -XX:+PrintTenuringDistribution -XX:+HeapDumpOnOutOfMemoryError -XX:HeapDumpPath=/tmp/hoodie-heapdump.hprof -Dlog4j.configuration=log4j-executor.properties -Dvm.logging.level=ERROR -Dvm.logging.name=UsageHudiStream -Duser.timezone=UTC" \
What’s missing in the configs?
Hudi Version = 0.5.2 Spark Version = 2.4.5
Issue Analytics
- State:
- Created 3 years ago
- Comments:9 (4 by maintainers)
Top GitHub Comments
Looks like balaji did beat me to it. 😃
@ygordefraga : This could be coming from the increase in the number of partitions. This could be related to https://github.com/apache/hudi/issues/2269#issuecomment-733299492
Also, note that since you increased the number of partition with the additional partitioning level, keeping the same number of executors wont be exactly apples-to-apples comparison.
Can you try 0.6.0 (which has incremental cleaning support) to see if you get better performance. Please note that the next version of hudi will come with consolidated metadata which will remove the listing altogether.