question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

How to shorten the merge time of multi-partitioned tables

See original GitHub issue

Hi team, I use Flink to write data from Kafka to Iceberg. Tables use date and event partitions like 'datekey=‘20220526’ and 'event=‘xxxx’. Event partitions generate more than 300 partitions per day. Because streaming writes produce a lot of small files(It will generate a Parquet file under their respective partitions almost every minute.), I tried using an official Spark program for data merging,as shown below:

SparkActions
                    .get()
                    .rewriteDataFiles(table)
                    .filter(Expressions.equal("datekey", dateKey))
                    .filter(Expressions.equal("event", event))
                    .option("target-file-size-bytes", Long.toString(128 * 1024 * 1024))
                    .execute();

image

Unfortunately, it was too slow, taking almost half a day to merge all the data in one day. When I try a multi-threaded parallel merge, the commit of the metadata doesn’t seem to work, giving this error: “Cannot commit: stale table metadata”.

Questions are as follows: 1.How to shorten the data merge time of multi-partition table? 2.How to coordinate the work of these three action ‘rewriteDataFiles’,‘expireSnapshots’,‘deleteOrphanFiles’ , I don’t know when to execute them and in what order. 3.Will the iceberg support automatic merging of small files in the future? if yes, it would take a lot less extra work

Can someone answer my doubts?

ps. My iceberg version is 0.13.1.

Best regards,

Cqz

Issue Analytics

  • State:closed
  • Created a year ago
  • Comments:7 (1 by maintainers)

github_iconTop GitHub Comments

3reactions
kbendickcommented, May 26, 2022

Lastly, all of that frequent committing that is generating small files would likely be easier to manage if you also ran the RewriteManifests action. This will rewrite the Iceberg metadata manifests, packing them into larger files. This will reduce the time spent on planning, as likely your metadata has also gotten larger.

I’ll leave you with the link to the Spark SQL stored procedure, https://iceberg.apache.org/docs/latest/spark-procedures/#rewrite_manifests, but I would similarly encourage you to check out the code for further options and understanding of how it works given the scale of your table.

Lastly, I usually advocate for writing the data out as “correctly” as possible the first time. In this case, I don’t think that’s entirely possible (and table maintenance is a fact of life with Iceberg), but if you reduce your Flink jobs commit interval just slightly, you’ll likely see significantly faster performance speed up. You should also try setting the write distrribution mode.

If your Flink job shuffles data to the same task manager for similar output, you’ll potentially wind up with fewer files.

I’d suggest starting from https://iceberg.apache.org/docs/latest/configuration/#write-properties (looking at write.distribution.mode), as well as looking for a larger summary in the relevant file for that configuration value, https://github.com/apache/iceberg/blob/master/api/src/main/java/org/apache/iceberg/DistributionMode.java.

The PR for adding write.distribution.mode support to Flink was added by openinx in https://github.com/apache/iceberg/commit/c75ac359c1de6bf9fd4894b40009c5c42d2fee9d, which might also be of interest to you to see the Flink specific behavior and any caveats. By shuffling the data for each partition to one task manager on write (or a handful of them potentially), then you’ll see fewer smaller files with fewer writers for each partition.

But I think the biggest thinig to help you, given that your rewrite tasks are all taking about a minute, would be to parallelize (as you tried), but via partial-progress.enabled and max-concurrent-file-group-rewrites. This will likely leave the job with the same CPU time in total, but will reduce the wall clock run time of the job significantly. That and using write.distribution.mode to produce fewer small files to start with (as called out in the javadoc comment for the DistributionMode enum). Plus committing with Flink every 2 minutes or even minute and a half instead of 1 minute will likely help a good bit and for most use cases the additional latency of 1 minute really isn’t that much.

Best of luck and please let me know if / when we can close this issue!

2reactions
RussellSpitzercommented, May 26, 2022

Agree with @kbendick , Max-concurrent-file-group rewrites alone should be boosted up and that would give you way better performance. It defaults to 1, which is very good if you have very large partitions with lots of big data files, but bad if you have tons of very small partitions with very small data files.

Read more comments on GitHub >

github_iconTop Results From Across the Web

Maintenance Operations for Partitioned Tables and Indexes
To remove data in the partition without dropping the partition, use the TRUNCATE PARTITION statement. If local indexes are defined for the table,...
Read more >
Updating partitioned table data using DML | BigQuery
You use a DML MERGE statement to combine INSERT , UPDATE , and DELETE operations for a partitioned table into one statement and...
Read more >
How to merge existing hourly partitions to daily partition in hive
1 Answer 1 ... The easy way is to extract date from current partition column and load into new table. ... Then drop...
Read more >
Hive Multiple Small Files - Cloudera Community - 204038
If you partition the table on a daily basis with less size then in growth of time it will cause performance issues and...
Read more >
Database table partitioning - GitLab Docs
Unfortunately, tables can only be partitioned at their creation, making it nontrivial to apply to a busy database. A suite of migration tools...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found