question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

RepairManifestsAction

See original GitHub issue

A while ago @aokolnychyi had suggested a new action “RepairManifestsAction”, and I wanted to discuss prior to starting to look at it.

The initial idea is an action that reads DataFiles and uses it to repair Iceberg table manifests. An example usage would be after a metadata bug fix, such as #1980 (file size incorrect in manifest file)

So one proposal is to have a RepairManifestsAction with the typical apis like: actions.repairManifests().filter(Expression expr).caseSensitive(boolean value). //filter out data files that this action will read

And have following options that would make different levels of Spark jobs, in how much of the data-file they read: withRepairFileSize(boolean value). //Reads file sizes from FileSystem, rewrites manifests withRepairRowCount(boolean value) //Opens the file and counts rows, rewrites manifests with the row count withRepairMetrics(boolean value) //Opens the file, and from the file metadata rewrites the manifest with latest metrics withRemoveDeletedFiles(boolean value) //if any datafile referenced by manifest is removed, remove it from manifest

The last one may also be helpful, if a Data File becomes accidentally removed or corrupted , an exception “java.io.FileNotFoundException: No such file or directory” is hit on any query to that Iceberg table that would have hit that data-file. Maybe later having ‘repair manifest-list or repair metadata’ later is also useful to handle cases if manifest files themselves are removed.

This proposal is a bit different than ‘RewriteManifests’ as that just rewrites manifests but does not fix them by reading data-files. The only way to solve an issue like #1980 today would probably be just rewriting all data files.

Issue Analytics

  • State:open
  • Created 2 years ago
  • Comments:16 (8 by maintainers)

github_iconTop GitHub Comments

1reaction
RussellSpitzercommented, Apr 26, 2021

Ah yeah, I think we usually do new custom Scan’s, See https://github.com/apache/iceberg/blob/master/core/src/main/java/org/apache/iceberg/ManifestEntriesTable.java#L96-L140 For example. I would hesitate to touch anything around “planFiles” for this and would instead op for reading the manifest files directly

0reactions
chenwyi2commented, Nov 8, 2022

@KarlManong how you solved this prpblem ? the only way is putting a metadata file into hdfs?

Read more comments on GitHub >

github_iconTop Results From Across the Web

Spark Iceberg manifest reports wrong parquet file sizes. #1980
We are using spark iceberg and some iceberg manifest files report the wrong data file (parquet) size, it's ~ 2x larger than the...
Read more >
[GitHub] [iceberg] szehon-ho edited a comment on issue #2435 ...
[GitHub] [iceberg] szehon-ho edited a comment on issue #2435: RepairManifestsAction · GitBox Thu, 08 Apr 2021 08:29:23 -0700. szehon-ho edited a comment on ......
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found