question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

[SUPPORT]S3 file listing causing compaction to get eventually slow

See original GitHub issue

We are running incremental updates to our MoR table on S3. We are running updates every 10 minutes. We compact every 10 commits (every ~1.5 hour). we have noticed that if we want to keep history for longer then few hours (set cleanup to clean after 50 commits ) , then compaction time starts increasing as number of files in s3 increase. Chart below shows time taken to upsert incremental changes to the table, spikes indicate the commit when inline compaction got triggered. git_compaction
when looking into logs we have noticed that majority of the time is spend listing recursively all files in tables S3 folder. more specifically, logs contain following lines:

20/07/15 13:58:19 INFO HoodieMergeOnReadTableCompactor: Compacting s3://bucket/table with commit 20200715135819
20/07/15 14:36:04 INFO HoodieMergeOnReadTableCompactor: Compaction looking for files to compact in [0, 1, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 2, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 3, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 4, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 5, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 6, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 7, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 8, 80, 81, 82, 83, 84, 85, 86, 87, 88, 89, 9, 90, 91, 92, 93, 94, 95, 96, 97, 98, 99] partitions

the code lines that gets executed between those two log lines are: https://github.com/apache/hudi/blob/master/hudi-client/src/main/java/org/apache/hudi/table/action/compact/HoodieMergeOnReadTableCompactor.java#L181-L194
I put log lines around various parts of that code to measure time and was able to narrow down to this function
https://github.com/apache/hudi/blob/master/hudi-common/src/main/java/org/apache/hudi/common/fs/FSUtils.java#L225
as a meter of fact compaction that took 50+ minutes, 38 of that 50+ minutes was executing that function, which looks like mostly recursively list files in S3 table location.
This issue observed on all tables, however it most noticeable at tables where incremental updates update large number of partitions (50% of all partitions).

some table stats
100 partitions, initial size 100gb, initial file count 6k, we observed 50+ minutes compaction after table grew to 300gb, 20k files.

Environment Description

  • Hudi version : master branch

  • Spark version : 2.4.4

  • Hive version : 2.3.6

  • Hadoop version : 2.8.5

  • Storage (HDFS/S3/GCS…) : S3

  • Running on Docker? (yes/no) : no

Issue Analytics

  • State:closed
  • Created 3 years ago
  • Comments:7 (6 by maintainers)

github_iconTop GitHub Comments

1reaction
vinothchandarcommented, Jul 18, 2020

This means that we need to move out these listing logic our of hudi-common if we want to parallelize it with spark context.

I will be landing a PR over the weekend, that avoids listings for rollbacks… consequently, I moved the place you are changing into hudi-client already… So it should be simple to redo on top of that.

Overall, We already have a StorageSchemes class that does different things for S3/GCS etc and HDFS/Ignite… based on append support… As a more elegant fix, I feel if we can take a pass at listing usages and do different forms of listing based on storage schemes…

0reactions
vinothchandarcommented, Jun 5, 2021

Metadata table is out for couple releases now.

Read more comments on GitHub >

github_iconTop Results From Across the Web

Troubleshoot slow or inconsistent speeds when downloading ...
When I download from or upload to Amazon S3 from a specific network or machine, my requests might get higher latency. How can...
Read more >
Dealing with Small Files Issues on S3: A Guide to Compaction
Small Files Create Too Much Latency For Data Analytics · Compaction – Turning Many Small Files into Fewer Large Files to Reduce Query...
Read more >
Best practices: Delta Lake | Databricks on AWS
Best practices: Delta Lake · Provide data location hints · Compact files · Replace the content or schema of a table · Spark...
Read more >
Exploring ETL Options for Compaction on AWS - Data By Dan
Spark is notoriously slow in processing a large number of small files because S3 is not a true filesystem leading to list operations...
Read more >
High Frequency Small Files vs. Slow Moving Datasets - Dremio
First, if the compaction process was unable to keep up, queries on the data lake would suffer due to expensive file listings.
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found