Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Make DeleteOrphanFiles in Spark reliable

See original GitHub issue

There have been multiple attempts to make our DeleteOrphanFiles action more reliable. One such discussion happened more than a year ago. However, we never reached consensus.

I will try to summarize my current thoughts but I encourage everyone to comment as well.

Location Generation

There are three location types in Iceberg.

Table location

Table locations are either provided by the user or defaulted in TableOperations. When defaulting, we currently manipulate raw strings via methods such as String$format. That means there is no normalization/validation for root table locations.

Metadata

Classes that extend BaseMetastoreTableOperations use metadataFileLocation to generate a new location for all types of metadata files. Under the hood, it simply uses String$format and has no location normalization.

private String metadataFileLocation(TableMetadata metadata, String filename) {
  String metadataLocation = metadata.properties().get(TableProperties.WRITE_METADATA_LOCATION);

  if (metadataLocation != null) {
    return String.format("%s/%s", metadataLocation, filename);
  } else {
    return String.format("%s/%s/%s", metadata.location(), METADATA_FOLDER_NAME, filename);
  }
}

In HadoopTableOperations, we rely on Path instead of String$format as we have access to Hadoop classes.

private Path metadataPath(String filename) {
  return new Path(metadataRoot(), filename);
}

private Path metadataRoot() {
  return new Path(location, "metadata");
}

That means some normalization is happening for metadata file locations generated in HadoopTableOperations.

Data

Data file locations depend on LocationProvider returned by TableOperations. While users can inject a custom location provider, Iceberg has two built-in implementations:

DefaultLocationProvider
ObjectStoreLocationProvider

Both built-in implementations use String$format and have no normalization/validation.

Problem

Right now, DeleteOrphanFiles uses Hadoop FileSystem to list all actual files in a given location and compares them to the locations stored in the metadata. As discussed above, Iceberg does not do any normalization for locations persisted in the metadata. That means locations retuned during listing may have cosmetic differences compared to locations stored in the metadata, even though both can point to the same files. As a consequence, DeleteOrphanFiles can corrupt a table.

Proposed Approach

We cannot change what is already stored in the metadata so DeleteOrphanFiles should normalize locations of reachable files. Since we do listing via Hadoop FileSystem, we should probably leverage Hadoop classes for normalization to avoid surprises. For example, just constructing a new Path from a String normalizes the path part of the URI.

Path path = new Path("hdfs://localhost:8020/user//log/data///dummy_file/");
path.toString() // hdfs://localhost:8020/user/log/data/dummy_file

Normalization is required but does not solve all issues. Since table locations are arbitrary, we may hit a few weird cases.
- Data or metadata locations without a scheme and authority.
- Changes in the Hadoop conf. We may have one set of configured file systems when the table was created and a completely different one when deleting orphans. For example, the scheme name can change (it is just a string), the authority can be represented via an IP address instead of a host name or multiple host names can be mapped into the same name node.
- I am not sure whether it is possible but can someone migrate from s3a to s3 or vice versa?
The action should expose options to ignore the scheme and authority during the comparison. If that happens, only normalized paths will be compared.
The location we are about to clean must be validated. If the action is configured to take scheme and authority into account, the provided location must have those set. In other words, it is illegal to provide a location without an authority if the action is supposed to compare authorities.
Locations persisted in the metadata without a scheme and authority must inherit those values from the location we scan for orphans, not from the current Hadoop conf. This essentially means we will only compare the normalized path for such locations.

When it comes to possible implementations, we can call mapPartitions on DataFrame with locations.

Path path = new Path(location); // should normalize the path
URI uri = path.toUri(); // should give us access to scheme, authority, path
... // run validation, inherit scheme and authority or ignore them if not needed
Path newPath = new Path(newScheme, newAuthority, uri.getPath())
return newPath.toString();

I know @karuppayya has been working on a fix so I wanted to make sure we build consensus first.

cc @karuppayya @RussellSpitzer @rdblue @flyrain @szehon-ho @jackye1995 @pvary @openinx @rymurr

Issue Analytics

State:
Created 2 years ago
Reactions:1
Comments:21 (14 by maintainers)

Top GitHub Comments

2reactions

aokolnychyicommented, Apr 13, 2022

@kbendick, I agree it is useful to supply locations instead of relying on listing. I believe there is an open PR that can be merged prior to any work discussed here.

@karuppayya and I spent some time discussing and I personally think @szehon-ho’s idea with having an error mode is quite promising. I’d probably have only error and ignore modes and combine it with other ideas mentioned on this thread.

Normalize the path part of URIs to avoid cosmetic differences like extra slashes.
Introduce prefix-mismatch-mode option. Possible values are error (default) and ignore.
Expose ways to influence the comparison. For instance, allow passing equivalent schemes.

I like this approach because it will throw an exception if something suspicious happens and will provide a user ways to resolve conflicts instead of silently taking some action.

The actual algorithm can be like this:

Build actual file DF
- Either provided by the user or acquired via listing. If listing, the location must contain a scheme and authority.
Build reachable file DF via metadata tables
Transform both actual and reachable DFs so that they contain scheme, authority, path columns.
Perform LEFT OUTER JOIN on path and map partitions.

| actual_scheme | actual_authority | path | valid_scheme | valid_authority | path |
 ---------------------------------------------------------------------------------
s3, bucket1, p0, null, null, null -> orphan (no match for the normalized path)
s3, bucket1, p1, null, null, p1 -> not orphan (null scheme/authority in metadata match any scheme/authority)
s3, bucket1, p2, s3a, bucket1, p2 -> not orphan (must have defaults for equivalent schemes like s3 and s3a)
s3, bucket1, p3, s3a, bucket2, p3 -> error by default and can be either ignored or the user may indicate that bucket1 and bucket2 are different, which will make s3, bucket1, p3 orphan.

This way, we don’t need separate DFs with and without prefix and can have more sophisticated comparison.

Any thoughts?

1reaction

rdbluecommented, Apr 13, 2022

@aokolnychyi, that plan sounds great to me. I think that covers all the cases we need to.