question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

[SUPPORT] Slow insert into COW tables with multi level partitions

See original GitHub issue

Hi there,

I have a problem with writing when there are many partitions.

Data

The DataFrame i’m working with is pretty simple. It contains all the messages sent by the organizations.

message_id timestamp status organization_id year month day
bdabfa6f-2a3e-4c17-acd7-350227473ae4 2020-11-25T10:00:00Z SENT 0b38bec3-15ac-4e57-9bb9-48d7de412ffa 2020 11 25
203d5495-9b5d-4003-b7f3-ab312a70db40 2020-11-25T11:00:00Z SENT 75e498d4-c979-4a12-b8df-1051c7976d34 2020 11 24
09fa0543-cf5a-4e6b-9d16-ad14a8a7058a 2020-10-22T09:00:00Z NOT_SENT 0b38bec3-15ac-4e57-9bb9-48d7de412ffa 2020 10 22

Previous Scenario (Good performance)

Writing was taking about a 40 seconds to write 100k rows when partitioning contained year, month and day only. But this kind of partitioning was not the best for my use case.

Partitions

year=2020/month=10/day=22
year=2020/month=11/day=24
year=2020/month=11/day=25

Current Scenario (Bad performance)

I changed partitioning to have organization_id before the other partitions. In this approach the writing time has increased considerably and now it takes about 5 minutes to write 100k rows. There are hundreds of organizations and each of them will have a partition per day.

Partitions


organization_id=0b38bec3-15ac-4e57-9bb9-48d7de412ffa/year=2020/month=10/day=22
organization_id=75e498d4-c979-4a12-b8df-1051c7976d34/year=2020/month=11/day=24
organization_id=0b38bec3-15ac-4e57-9bb9-48d7de412ffa/year=2020/month=11/day=25

Execution time

Screenshot from 2020-11-26 14-22-07

Configuration

Both approaches use the same configs. Some settings were based on the Tuning Guide.

Hudi Configs

"hoodie.datasource.write.insert.drop.duplicates" -> "true"
"hoodie.insert.shuffle.parallelism" -> "27"
"hoodie.finalize.write.parallelism" -> "27"
"hoodie.datasource.write.recordkey.field" -> "message_id"
"hoodie.datasource.write.precombine.field" -> "timestamp"
"hoodie.datasource.write.partitionpath.field" -> "organization_id,year,month,day"
"hoodie.datasource.write.keygenerator.class" -> classOf[ComplexKeyGenerator].getName

Spark submit configs

  ...
  --driver-memory 4G \
  --executor-memory 8G \
  --executor-cores 3 \
  --num-executors 3 \
  ...
  --conf spark.driver.memoryOverhead=1024 \
  --conf spark.executor.memoryOverhead=2048 \
  --conf spark.memory.fraction=0.9 \
  --conf spark.memory.storageFraction=0.2 \
  --conf spark.sql.shuffle.partitions=27 \
  --conf spark.default.parallelism=27 \
  --conf spark.sql.hive.convertMetastoreParquet=false \
  --conf spark.serializer=org.apache.spark.serializer.KryoSerializer \
  --conf spark.rdd.compress=true \
  --conf spark.shuffle.service.enabled=true \
  --conf spark.executor.extraJavaOptions="-XX:NewSize=1g -XX:SurvivorRatio=2 -XX:+UseCompressedOops -XX:+UseConcMarkSweepGC -XX:+UseParNewGC -XX:CMSInitiatingOccupancyFraction=70 -XX:+PrintGCDetails -XX:+PrintGCTimeStamps -XX:+PrintGCDateStamps -XX:+PrintGCApplicationStoppedTime -XX:+PrintGCApplicationConcurrentTime -XX:+PrintTenuringDistribution -XX:+HeapDumpOnOutOfMemoryError -XX:HeapDumpPath=/tmp/hoodie-heapdump.hprof -Dlog4j.configuration=log4j-executor.properties -Dvm.logging.level=ERROR -Dvm.logging.name=UsageHudiStream -Duser.timezone=UTC" \

What’s missing in the configs?

Hudi Version = 0.5.2 Spark Version = 2.4.5

Issue Analytics

  • State:closed
  • Created 3 years ago
  • Comments:9 (4 by maintainers)

github_iconTop GitHub Comments

1reaction
vinothchandarcommented, Nov 30, 2020

Looks like balaji did beat me to it. 😃

1reaction
bvaradarcommented, Nov 30, 2020

@ygordefraga : This could be coming from the increase in the number of partitions. This could be related to https://github.com/apache/hudi/issues/2269#issuecomment-733299492

Also, note that since you increased the number of partition with the additional partitioning level, keeping the same number of executors wont be exactly apples-to-apples comparison.

Can you try 0.6.0 (which has incremental cleaning support) to see if you get better performance. Please note that the next version of hudi will come with consolidated metadata which will remove the listing altogether.

Read more comments on GitHub >

github_iconTop Results From Across the Web

Slow inserts into partitioned table - Oracle Communities
I am having trouble inserting into a simple partitioned table after an upgrade to 11.2.0.3. I'm seeing insert speeds of subsecond up to...
Read more >
About Uniform Multi-level Partitioned Tables
GPORCA supports queries on a multi-level partitioned (MLP) table if the MLP table is a uniform ... A uniform partitioned table must meet...
Read more >
All Configurations | Apache Hudi
These configs provide deep control over lower level aspects like file sizing, compression ... Use bulkinsert to load new data into a table,...
Read more >
Detailed Multilevel Partitioning Example - Teradata Database
The table on the following page shows how rows would be grouped on an AMP and, for each row, the partition number for...
Read more >
Partitioned tables and indexes - SQL Server ... - Microsoft Learn
When multiple files exist in a filegroup, data is spread across ... Partitioned tables and indexes are available in all service tiers of ......
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found