RepairManifestsAction
See original GitHub issueA while ago @aokolnychyi had suggested a new action “RepairManifestsAction”, and I wanted to discuss prior to starting to look at it.
The initial idea is an action that reads DataFiles and uses it to repair Iceberg table manifests. An example usage would be after a metadata bug fix, such as #1980 (file size incorrect in manifest file)
So one proposal is to have a RepairManifestsAction with the typical apis like:
actions.repairManifests().filter(Expression expr).caseSensitive(boolean value)
. //filter out data files that this action will read
And have following options that would make different levels of Spark jobs, in how much of the data-file they read:
withRepairFileSize(boolean value)
. //Reads file sizes from FileSystem, rewrites manifests
withRepairRowCount(boolean value)
//Opens the file and counts rows, rewrites manifests with the row count
withRepairMetrics(boolean value)
//Opens the file, and from the file metadata rewrites the manifest with latest metrics
withRemoveDeletedFiles(boolean value)
//if any datafile referenced by manifest is removed, remove it from manifest
The last one may also be helpful, if a Data File becomes accidentally removed or corrupted , an exception “java.io.FileNotFoundException: No such file or directory” is hit on any query to that Iceberg table that would have hit that data-file. Maybe later having ‘repair manifest-list or repair metadata’ later is also useful to handle cases if manifest files themselves are removed.
This proposal is a bit different than ‘RewriteManifests’ as that just rewrites manifests but does not fix them by reading data-files. The only way to solve an issue like #1980 today would probably be just rewriting all data files.
Issue Analytics
- State:
- Created 2 years ago
- Comments:16 (8 by maintainers)
Ah yeah, I think we usually do new custom Scan’s, See https://github.com/apache/iceberg/blob/master/core/src/main/java/org/apache/iceberg/ManifestEntriesTable.java#L96-L140 For example. I would hesitate to touch anything around “planFiles” for this and would instead op for reading the manifest files directly
@KarlManong how you solved this prpblem ? the only way is putting a metadata file into hdfs?