Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

ExpireSnapshots deletes active files

See original GitHub issue

Version:0.10.0 Spark version: 2.4.5

It’s possible to use the API to construct snapshots in such a way that expiring snapshots (with file deletion enabled) causes active data files to be deleted. This happens with an iceberg table that’s manually managed over raw parquet files written by spark (doesn’t really bear going into why). The basic steps are:

Create a partitioned iceberg table
Write two partitions (p1 and p2) as raw parquet data via spark
Append files to iceberg table
IMPORTANT Commit iceberg overwrite that
1. Deletes files appended in step 3
2. Re-adds those same files
Expire snapshot 1 with file deletion enabled
Write raw parquet data to a new directory containing data for partitions p2 and p3 (note that p2 is the same partition as in step 2)
Commit iceberg overwrite that
1. Deletes files in snapshot 2 from partition p2
2. Adds all new files from step 6
Expire snapshot 2 with file deletion enabled
Reading the iceberg table now fails because the files from p1, which are still active files, were deleted by the snapshot expiration in step 8

Here’s a script that shows how to reproduce:

// spark-shell --packages org.apache.iceberg:iceberg-spark-runtime:0.10.0

import java.sql.{Date,Timestamp}
import java.time.LocalDate
import org.apache.hadoop.fs._
import org.apache.spark.sql._
import org.apache.spark.sql.types._
import org.apache.spark.sql.functions._
import org.apache.spark.sql.expressions._
import scala.collection.JavaConversions._
import spark.implicits._
import org.apache.iceberg.hadoop.{HadoopTables,HadoopInputFile}
import org.apache.iceberg.spark.SparkSchemaUtil
import org.apache.iceberg.parquet.ParquetUtil
import org.apache.iceberg.{PartitionSpec,DataFiles,MetricsConfig}

val tableDir = "hdfs:///tmp/iceberg-table"

val fs = new Path(tableDir).getFileSystem(sc.hadoopConfiguration)
fs.delete(new Path(tableDir), true)

val tables = new HadoopTables(sc.hadoopConfiguration)

// Create simple table partitioned by day
val df1 = (
  spark
    .range(10L)
    .select(
      'id,
      concat(lit("value"), 'id) as 'value,
      when('id < 5L, Timestamp.valueOf("2021-01-19 00:00:00")).otherwise(Timestamp.valueOf("2021-01-20 00:00:00")) as 'ts
    )
)

val schema = SparkSchemaUtil.convert(df1.schema)
val table = tables.create(
  schema,
  PartitionSpec.builderFor(schema).day("ts").build,
  tableDir
)

// Get data files from a path
def getDataFiles(path: String) = {
  fs
    .globStatus(new Path(path, "*/*.parquet"))
    .map({ status => HadoopInputFile.fromStatus(status, sc.hadoopConfiguration) })
    .map({ inputFile =>
      DataFiles
        .builder(table.spec)
        .withInputFile(inputFile)
        .withMetrics(ParquetUtil.fileMetrics(inputFile, MetricsConfig.getDefault))
        .withPartitionPath(new Path(inputFile.location).getParent.getName)
        .build
    })
    .toSeq
}

// Write dataframe as raw parquet
(
  df1
    .withColumn("ts_day", date_format('ts, "yyyy-MM-dd"))
    .repartition(2)
    .sortWithinPartitions('ts)
    .write
    .partitionBy("ts_day")
    .mode("overwrite")
    .parquet(s"$tableDir/data/commit1")
)

// Append data files to iceberg table
val dataFiles = getDataFiles(s"$tableDir/data/commit1")

val append = table.newFastAppend
dataFiles.foreach(append.appendFile)
append.commit
table.refresh

// Table data appears OK
spark.read.format("iceberg").load(tableDir).show

// Issue an overwrite in which the appended datafiles are deleted, then re-added
val overwrite = table.newOverwrite
dataFiles.foreach(overwrite.deleteFile)
dataFiles.foreach(overwrite.addFile)
overwrite.commit
table.refresh

// Table data appears OK
spark.read.format("iceberg").load(tableDir).show

// Expire first snapshot (append) with file cleanup enabled
table.expireSnapshots.expireSnapshotId(table.snapshots.head.snapshotId).cleanExpiredFiles(true).commit
table.refresh

// Table data appears OK
spark.read.format("iceberg").load(tableDir).show

// Write new parquet data, with one new (2021-01-21) and one overwritten (2021-01-20) partition
(
  spark
    .range(5L, 15L)
    .select(
      'id,
      concat(lit("value"), 'id) as 'value,
      when('id < 10L, Timestamp.valueOf("2021-01-20 00:00:00")).otherwise(Timestamp.valueOf("2021-01-21 00:00:00")) as 'ts
    )
    .withColumn("ts_day", date_format('ts, "yyyy-MM-dd"))
    .repartition(2)
    .sortWithinPartitions('ts)
    .write
    .partitionBy("ts_day")
    .mode("overwrite")
    .parquet(s"$tableDir/data/commit2")
)

// Do an overwrite that deletes the old files from the overwritten partition
// (2021-01-20) and adds the fiels we just wrote for the overwritten and new
// partitions
//
val dataFiles2 = getDataFiles(s"$tableDir/data/commit2")

val overwrite = table.newOverwrite
dataFiles.filter({ file => LocalDate.ofEpochDay(file.partition.get(0, classOf[Integer]).toLong) == LocalDate.of(2021, 1, 20) }).foreach(overwrite.deleteFile)
dataFiles2.foreach(overwrite.addFile)
overwrite.commit
table.refresh

// Expire the second commit (the first overwrite)
table.expireSnapshots.expireSnapshotId(table.snapshots.head.snapshotId).cleanExpiredFiles(true).commit
table.refresh

// Throws an exception because the files from the original commit for the
// partition 2021-01-19 have been deleted, even though they were not affected
// by the most recent overwrite
spark.read.format("iceberg").load(tableDir).show

Clearly there’s user error here (we shouldn’t be deleting and re-adding the same files added in the previous snapshot), but it feels like iceberg is doing the wrong thing as well, as it deletes files that it still considers active. It feels like the right solution is either to:

Reject the commit in step 4 with an exception
Warn the user that they’re trying to both add and delete the same files and silently remove the affected files from the delete list
Detect during the expiration that the files to be deleted are still active and prevent them from getting deleted

Issue Analytics

State:
Created 3 years ago
Comments:5 (3 by maintainers)

Top GitHub Comments

1reaction

RussellSpitzercommented, Jan 21, 2021

There is an method “expire” which just returns the dataset of files that would be removed without removing files if you like, but if you just want to remove snapshots without cleaning files it should be essentially the same as the table API. The big difference in Table and Action implementations is in identifying which files need to be cleaned.

/**

Expires snapshots and commits the changes to the table, returning a Dataset of files to delete.
This does not delete data files. To delete data files, run {@link #execute()}.
This may be called before or after {@link #execute()} is called to return the expired file list.
@return a Dataset of files that are no longer referenced by the table */ public Dataset<Row> expire() {

0reactions

thesquelchedcommented, Jan 21, 2021

I noticed that there’s no equivalent in ExpireSnapshotsAction to ExpireSnapshots.cleanExpiredFiles(); does that mean that the former always cleans expired files?