question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

ExpireSnapshots deletes active files

See original GitHub issue

Version:0.10.0 Spark version: 2.4.5

It’s possible to use the API to construct snapshots in such a way that expiring snapshots (with file deletion enabled) causes active data files to be deleted. This happens with an iceberg table that’s manually managed over raw parquet files written by spark (doesn’t really bear going into why). The basic steps are:

  1. Create a partitioned iceberg table
  2. Write two partitions (p1 and p2) as raw parquet data via spark
  3. Append files to iceberg table
  4. IMPORTANT Commit iceberg overwrite that
    1. Deletes files appended in step 3
    2. Re-adds those same files
  5. Expire snapshot 1 with file deletion enabled
  6. Write raw parquet data to a new directory containing data for partitions p2 and p3 (note that p2 is the same partition as in step 2)
  7. Commit iceberg overwrite that
    1. Deletes files in snapshot 2 from partition p2
    2. Adds all new files from step 6
  8. Expire snapshot 2 with file deletion enabled
  9. Reading the iceberg table now fails because the files from p1, which are still active files, were deleted by the snapshot expiration in step 8

Here’s a script that shows how to reproduce:

// spark-shell --packages org.apache.iceberg:iceberg-spark-runtime:0.10.0

import java.sql.{Date,Timestamp}
import java.time.LocalDate
import org.apache.hadoop.fs._
import org.apache.spark.sql._
import org.apache.spark.sql.types._
import org.apache.spark.sql.functions._
import org.apache.spark.sql.expressions._
import scala.collection.JavaConversions._
import spark.implicits._
import org.apache.iceberg.hadoop.{HadoopTables,HadoopInputFile}
import org.apache.iceberg.spark.SparkSchemaUtil
import org.apache.iceberg.parquet.ParquetUtil
import org.apache.iceberg.{PartitionSpec,DataFiles,MetricsConfig}

val tableDir = "hdfs:///tmp/iceberg-table"

val fs = new Path(tableDir).getFileSystem(sc.hadoopConfiguration)
fs.delete(new Path(tableDir), true)

val tables = new HadoopTables(sc.hadoopConfiguration)

// Create simple table partitioned by day
val df1 = (
  spark
    .range(10L)
    .select(
      'id,
      concat(lit("value"), 'id) as 'value,
      when('id < 5L, Timestamp.valueOf("2021-01-19 00:00:00")).otherwise(Timestamp.valueOf("2021-01-20 00:00:00")) as 'ts
    )
)

val schema = SparkSchemaUtil.convert(df1.schema)
val table = tables.create(
  schema,
  PartitionSpec.builderFor(schema).day("ts").build,
  tableDir
)

// Get data files from a path
def getDataFiles(path: String) = {
  fs
    .globStatus(new Path(path, "*/*.parquet"))
    .map({ status => HadoopInputFile.fromStatus(status, sc.hadoopConfiguration) })
    .map({ inputFile =>
      DataFiles
        .builder(table.spec)
        .withInputFile(inputFile)
        .withMetrics(ParquetUtil.fileMetrics(inputFile, MetricsConfig.getDefault))
        .withPartitionPath(new Path(inputFile.location).getParent.getName)
        .build
    })
    .toSeq
}

// Write dataframe as raw parquet
(
  df1
    .withColumn("ts_day", date_format('ts, "yyyy-MM-dd"))
    .repartition(2)
    .sortWithinPartitions('ts)
    .write
    .partitionBy("ts_day")
    .mode("overwrite")
    .parquet(s"$tableDir/data/commit1")
)

// Append data files to iceberg table
val dataFiles = getDataFiles(s"$tableDir/data/commit1")

val append = table.newFastAppend
dataFiles.foreach(append.appendFile)
append.commit
table.refresh

// Table data appears OK
spark.read.format("iceberg").load(tableDir).show

// Issue an overwrite in which the appended datafiles are deleted, then re-added
val overwrite = table.newOverwrite
dataFiles.foreach(overwrite.deleteFile)
dataFiles.foreach(overwrite.addFile)
overwrite.commit
table.refresh

// Table data appears OK
spark.read.format("iceberg").load(tableDir).show

// Expire first snapshot (append) with file cleanup enabled
table.expireSnapshots.expireSnapshotId(table.snapshots.head.snapshotId).cleanExpiredFiles(true).commit
table.refresh

// Table data appears OK
spark.read.format("iceberg").load(tableDir).show

// Write new parquet data, with one new (2021-01-21) and one overwritten (2021-01-20) partition
(
  spark
    .range(5L, 15L)
    .select(
      'id,
      concat(lit("value"), 'id) as 'value,
      when('id < 10L, Timestamp.valueOf("2021-01-20 00:00:00")).otherwise(Timestamp.valueOf("2021-01-21 00:00:00")) as 'ts
    )
    .withColumn("ts_day", date_format('ts, "yyyy-MM-dd"))
    .repartition(2)
    .sortWithinPartitions('ts)
    .write
    .partitionBy("ts_day")
    .mode("overwrite")
    .parquet(s"$tableDir/data/commit2")
)

// Do an overwrite that deletes the old files from the overwritten partition
// (2021-01-20) and adds the fiels we just wrote for the overwritten and new
// partitions
//
val dataFiles2 = getDataFiles(s"$tableDir/data/commit2")

val overwrite = table.newOverwrite
dataFiles.filter({ file => LocalDate.ofEpochDay(file.partition.get(0, classOf[Integer]).toLong) == LocalDate.of(2021, 1, 20) }).foreach(overwrite.deleteFile)
dataFiles2.foreach(overwrite.addFile)
overwrite.commit
table.refresh

// Expire the second commit (the first overwrite)
table.expireSnapshots.expireSnapshotId(table.snapshots.head.snapshotId).cleanExpiredFiles(true).commit
table.refresh

// Throws an exception because the files from the original commit for the
// partition 2021-01-19 have been deleted, even though they were not affected
// by the most recent overwrite
spark.read.format("iceberg").load(tableDir).show

Clearly there’s user error here (we shouldn’t be deleting and re-adding the same files added in the previous snapshot), but it feels like iceberg is doing the wrong thing as well, as it deletes files that it still considers active. It feels like the right solution is either to:

  1. Reject the commit in step 4 with an exception
  2. Warn the user that they’re trying to both add and delete the same files and silently remove the affected files from the delete list
  3. Detect during the expiration that the files to be deleted are still active and prevent them from getting deleted

Issue Analytics

  • State:open
  • Created 3 years ago
  • Comments:5 (3 by maintainers)

github_iconTop GitHub Comments

1reaction
RussellSpitzercommented, Jan 21, 2021

There is an method “expire” which just returns the dataset of files that would be removed without removing files if you like, but if you just want to remove snapshots without cleaning files it should be essentially the same as the table API. The big difference in Table and Action implementations is in identifying which files need to be cleaned.

/**

  • Expires snapshots and commits the changes to the table, returning a Dataset of files to delete.
  • This does not delete data files. To delete data files, run {@link #execute()}.
  • This may be called before or after {@link #execute()} is called to return the expired file list.
  • @return a Dataset of files that are no longer referenced by the table */ public Dataset<Row> expire() {
0reactions
thesquelchedcommented, Jan 21, 2021

I noticed that there’s no equivalent in ExpireSnapshotsAction to ExpireSnapshots.cleanExpiredFiles(); does that mean that the former always cleans expired files?

Read more comments on GitHub >

github_iconTop Results From Across the Web

Recover snapshots from the Recycle Bin - AWS Documentation
Recycle Bin is a data recovery feature that enables you to restore accidentally deleted Amazon EBS snapshots and EBS-backed AMIs.
Read more >
Deleting Snapshots - VMware Docs
Deleting a snapshot removes the snapshot from the Snapshot Manager. The snapshot files are consolidated and written to the parent snapshot ...
Read more >
HOW TO RESTORE EXPIRED SNAPSHOT FROM OLD ...
What you can do here is, recover the cold backup on another system and extract what you have got there as snapshots. After...
Read more >
OneFS: How to recover individual file(s) from Snapshots ... - Dell
You can easily recover your own files, before snapshot expiration, simply by copying an earlier version from the snapshot to the original ...
Read more >
AWS Snapshot are not deleted once, when backup has expired.
Hi, I found a similar issue #1880. Could you check with whether the backup was already deleted when the snapshot still was found?...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found