question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Make DeleteOrphanFiles in Spark reliable

See original GitHub issue

There have been multiple attempts to make our DeleteOrphanFiles action more reliable. One such discussion happened more than a year ago. However, we never reached consensus.

I will try to summarize my current thoughts but I encourage everyone to comment as well.

Location Generation

There are three location types in Iceberg.

Table location

Table locations are either provided by the user or defaulted in TableOperations. When defaulting, we currently manipulate raw strings via methods such as String$format. That means there is no normalization/validation for root table locations.

Metadata

Classes that extend BaseMetastoreTableOperations use metadataFileLocation to generate a new location for all types of metadata files. Under the hood, it simply uses String$format and has no location normalization.

private String metadataFileLocation(TableMetadata metadata, String filename) {
  String metadataLocation = metadata.properties().get(TableProperties.WRITE_METADATA_LOCATION);

  if (metadataLocation != null) {
    return String.format("%s/%s", metadataLocation, filename);
  } else {
    return String.format("%s/%s/%s", metadata.location(), METADATA_FOLDER_NAME, filename);
  }
}

In HadoopTableOperations, we rely on Path instead of String$format as we have access to Hadoop classes.

private Path metadataPath(String filename) {
  return new Path(metadataRoot(), filename);
}

private Path metadataRoot() {
  return new Path(location, "metadata");
}

That means some normalization is happening for metadata file locations generated in HadoopTableOperations.

Data

Data file locations depend on LocationProvider returned by TableOperations. While users can inject a custom location provider, Iceberg has two built-in implementations:

  • DefaultLocationProvider
  • ObjectStoreLocationProvider

Both built-in implementations use String$format and have no normalization/validation.

Problem

Right now, DeleteOrphanFiles uses Hadoop FileSystem to list all actual files in a given location and compares them to the locations stored in the metadata. As discussed above, Iceberg does not do any normalization for locations persisted in the metadata. That means locations retuned during listing may have cosmetic differences compared to locations stored in the metadata, even though both can point to the same files. As a consequence, DeleteOrphanFiles can corrupt a table.

Proposed Approach

  • We cannot change what is already stored in the metadata so DeleteOrphanFiles should normalize locations of reachable files. Since we do listing via Hadoop FileSystem, we should probably leverage Hadoop classes for normalization to avoid surprises. For example, just constructing a new Path from a String normalizes the path part of the URI.
Path path = new Path("hdfs://localhost:8020/user//log/data///dummy_file/");
path.toString() // hdfs://localhost:8020/user/log/data/dummy_file
  • Normalization is required but does not solve all issues. Since table locations are arbitrary, we may hit a few weird cases.

    • Data or metadata locations without a scheme and authority.
    • Changes in the Hadoop conf. We may have one set of configured file systems when the table was created and a completely different one when deleting orphans. For example, the scheme name can change (it is just a string), the authority can be represented via an IP address instead of a host name or multiple host names can be mapped into the same name node.
    • I am not sure whether it is possible but can someone migrate from s3a to s3 or vice versa?
  • The action should expose options to ignore the scheme and authority during the comparison. If that happens, only normalized paths will be compared.

  • The location we are about to clean must be validated. If the action is configured to take scheme and authority into account, the provided location must have those set. In other words, it is illegal to provide a location without an authority if the action is supposed to compare authorities.

  • Locations persisted in the metadata without a scheme and authority must inherit those values from the location we scan for orphans, not from the current Hadoop conf. This essentially means we will only compare the normalized path for such locations.

When it comes to possible implementations, we can call mapPartitions on DataFrame with locations.

Path path = new Path(location); // should normalize the path
URI uri = path.toUri(); // should give us access to scheme, authority, path
... // run validation, inherit scheme and authority or ignore them if not needed
Path newPath = new Path(newScheme, newAuthority, uri.getPath())
return newPath.toString();

I know @karuppayya has been working on a fix so I wanted to make sure we build consensus first.

cc @karuppayya @RussellSpitzer @rdblue @flyrain @szehon-ho @jackye1995 @pvary @openinx @rymurr

Issue Analytics

  • State:closed
  • Created 2 years ago
  • Reactions:1
  • Comments:21 (14 by maintainers)

github_iconTop GitHub Comments

2reactions
aokolnychyicommented, Apr 13, 2022

@kbendick, I agree it is useful to supply locations instead of relying on listing. I believe there is an open PR that can be merged prior to any work discussed here.

@karuppayya and I spent some time discussing and I personally think @szehon-ho’s idea with having an error mode is quite promising. I’d probably have only error and ignore modes and combine it with other ideas mentioned on this thread.

  • Normalize the path part of URIs to avoid cosmetic differences like extra slashes.
  • Introduce prefix-mismatch-mode option. Possible values are error (default) and ignore.
  • Expose ways to influence the comparison. For instance, allow passing equivalent schemes.

I like this approach because it will throw an exception if something suspicious happens and will provide a user ways to resolve conflicts instead of silently taking some action.

The actual algorithm can be like this:

  • Build actual file DF
    • Either provided by the user or acquired via listing. If listing, the location must contain a scheme and authority.
  • Build reachable file DF via metadata tables
  • Transform both actual and reachable DFs so that they contain scheme, authority, path columns.
  • Perform LEFT OUTER JOIN on path and map partitions.
| actual_scheme | actual_authority | path | valid_scheme | valid_authority | path |
 ---------------------------------------------------------------------------------
s3, bucket1, p0, null, null, null -> orphan (no match for the normalized path)
s3, bucket1, p1, null, null, p1 -> not orphan (null scheme/authority in metadata match any scheme/authority)
s3, bucket1, p2, s3a, bucket1, p2 -> not orphan (must have defaults for equivalent schemes like s3 and s3a)
s3, bucket1, p3, s3a, bucket2, p3 -> error by default and can be either ignored or the user may indicate that bucket1 and bucket2 are different, which will make s3, bucket1, p3 orphan. 

This way, we don’t need separate DFs with and without prefix and can have more sophisticated comparison.

Any thoughts?

1reaction
rdbluecommented, Apr 13, 2022

@aokolnychyi, that plan sounds great to me. I think that covers all the cases we need to.

Read more comments on GitHub >

github_iconTop Results From Across the Web

Delete Orphan Files makes metadata inconsistent and table ...
After that we could't use the table anymore: at org.apache.spark.sql.execution.datasources.v2.V2TableWriteExec.writeWithV2( ...
Read more >
Spark - Is there a way to cleanup orphaned RDD files and ...
Spark - Is there a way to cleanup orphaned RDD files and block manager folders (using pyspark)? - Stack Overflow. Stack Overflow for...
Read more >
Maintenance - Apache Iceberg
Delete orphan files. In Spark and other distributed processing engines, task or job failures can leave files that are not referenced by table...
Read more >
Remove files which are not referenced by a Delta Table in Spark
This recipe helps you remove the files which are no longer referenced by a Delta Table in Spark.
Read more >
Introduction to Apache Iceberg Using Spark - Dremio
Delete files – Tracks records that still exist in the data files, but that should be considered as deleted. Metadata Layer. Apache Iceberg...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found