Make DeleteOrphanFiles in Spark reliable
See original GitHub issueThere have been multiple attempts to make our DeleteOrphanFiles
action more reliable. One such discussion happened more than a year ago. However, we never reached consensus.
I will try to summarize my current thoughts but I encourage everyone to comment as well.
Location Generation
There are three location types in Iceberg.
Table location
Table locations are either provided by the user or defaulted in TableOperations
. When defaulting, we currently manipulate raw strings via methods such as String$format
. That means there is no normalization/validation for root table locations.
Metadata
Classes that extend BaseMetastoreTableOperations
use metadataFileLocation
to generate a new location for all types of metadata files. Under the hood, it simply uses String$format
and has no location normalization.
private String metadataFileLocation(TableMetadata metadata, String filename) {
String metadataLocation = metadata.properties().get(TableProperties.WRITE_METADATA_LOCATION);
if (metadataLocation != null) {
return String.format("%s/%s", metadataLocation, filename);
} else {
return String.format("%s/%s/%s", metadata.location(), METADATA_FOLDER_NAME, filename);
}
}
In HadoopTableOperations
, we rely on Path
instead of String$format
as we have access to Hadoop classes.
private Path metadataPath(String filename) {
return new Path(metadataRoot(), filename);
}
private Path metadataRoot() {
return new Path(location, "metadata");
}
That means some normalization is happening for metadata file locations generated in HadoopTableOperations
.
Data
Data file locations depend on LocationProvider
returned by TableOperations
. While users can inject a custom location provider, Iceberg has two built-in implementations:
- DefaultLocationProvider
- ObjectStoreLocationProvider
Both built-in implementations use String$format
and have no normalization/validation.
Problem
Right now, DeleteOrphanFiles
uses Hadoop FileSystem
to list all actual files in a given location and compares them to the locations stored in the metadata. As discussed above, Iceberg does not do any normalization for locations persisted in the metadata. That means locations retuned during listing may have cosmetic differences compared to locations stored in the metadata, even though both can point to the same files. As a consequence, DeleteOrphanFiles
can corrupt a table.
Proposed Approach
- We cannot change what is already stored in the metadata so
DeleteOrphanFiles
should normalize locations of reachable files. Since we do listing via HadoopFileSystem
, we should probably leverage Hadoop classes for normalization to avoid surprises. For example, just constructing a newPath
from aString
normalizes the path part of the URI.
Path path = new Path("hdfs://localhost:8020/user//log/data///dummy_file/");
path.toString() // hdfs://localhost:8020/user/log/data/dummy_file
-
Normalization is required but does not solve all issues. Since table locations are arbitrary, we may hit a few weird cases.
- Data or metadata locations without a scheme and authority.
- Changes in the Hadoop conf. We may have one set of configured file systems when the table was created and a completely different one when deleting orphans. For example, the scheme name can change (it is just a string), the authority can be represented via an IP address instead of a host name or multiple host names can be mapped into the same name node.
- I am not sure whether it is possible but can someone migrate from
s3a
tos3
or vice versa?
-
The action should expose options to ignore the scheme and authority during the comparison. If that happens, only normalized paths will be compared.
-
The location we are about to clean must be validated. If the action is configured to take scheme and authority into account, the provided location must have those set. In other words, it is illegal to provide a location without an authority if the action is supposed to compare authorities.
-
Locations persisted in the metadata without a scheme and authority must inherit those values from the location we scan for orphans, not from the current Hadoop conf. This essentially means we will only compare the normalized path for such locations.
When it comes to possible implementations, we can call mapPartitions
on DataFrame
with locations.
Path path = new Path(location); // should normalize the path
URI uri = path.toUri(); // should give us access to scheme, authority, path
... // run validation, inherit scheme and authority or ignore them if not needed
Path newPath = new Path(newScheme, newAuthority, uri.getPath())
return newPath.toString();
I know @karuppayya has been working on a fix so I wanted to make sure we build consensus first.
cc @karuppayya @RussellSpitzer @rdblue @flyrain @szehon-ho @jackye1995 @pvary @openinx @rymurr
Issue Analytics
- State:
- Created 2 years ago
- Reactions:1
- Comments:21 (14 by maintainers)
Top GitHub Comments
@kbendick, I agree it is useful to supply locations instead of relying on listing. I believe there is an open PR that can be merged prior to any work discussed here.
@karuppayya and I spent some time discussing and I personally think @szehon-ho’s idea with having an error mode is quite promising. I’d probably have only
error
andignore
modes and combine it with other ideas mentioned on this thread.prefix-mismatch-mode
option. Possible values areerror
(default) andignore
.I like this approach because it will throw an exception if something suspicious happens and will provide a user ways to resolve conflicts instead of silently taking some action.
The actual algorithm can be like this:
scheme
,authority
,path
columns.path
and map partitions.This way, we don’t need separate DFs with and without prefix and can have more sophisticated comparison.
Any thoughts?
@aokolnychyi, that plan sounds great to me. I think that covers all the cases we need to.