ExpireSnapshots deletes active files
See original GitHub issueVersion:0.10.0
Spark version: 2.4.5
It’s possible to use the API to construct snapshots in such a way that expiring snapshots (with file deletion enabled) causes active data files to be deleted. This happens with an iceberg table that’s manually managed over raw parquet files written by spark (doesn’t really bear going into why). The basic steps are:
- Create a partitioned iceberg table
- Write two partitions (
p1
andp2
) as raw parquet data via spark - Append files to iceberg table
- IMPORTANT Commit iceberg overwrite that
- Deletes files appended in step 3
- Re-adds those same files
- Expire snapshot 1 with file deletion enabled
- Write raw parquet data to a new directory containing data for partitions
p2
andp3
(note thatp2
is the same partition as in step 2) - Commit iceberg overwrite that
- Deletes files in snapshot 2 from partition
p2
- Adds all new files from step 6
- Deletes files in snapshot 2 from partition
- Expire snapshot 2 with file deletion enabled
- Reading the iceberg table now fails because the files from
p1
, which are still active files, were deleted by the snapshot expiration in step 8
Here’s a script that shows how to reproduce:
// spark-shell --packages org.apache.iceberg:iceberg-spark-runtime:0.10.0
import java.sql.{Date,Timestamp}
import java.time.LocalDate
import org.apache.hadoop.fs._
import org.apache.spark.sql._
import org.apache.spark.sql.types._
import org.apache.spark.sql.functions._
import org.apache.spark.sql.expressions._
import scala.collection.JavaConversions._
import spark.implicits._
import org.apache.iceberg.hadoop.{HadoopTables,HadoopInputFile}
import org.apache.iceberg.spark.SparkSchemaUtil
import org.apache.iceberg.parquet.ParquetUtil
import org.apache.iceberg.{PartitionSpec,DataFiles,MetricsConfig}
val tableDir = "hdfs:///tmp/iceberg-table"
val fs = new Path(tableDir).getFileSystem(sc.hadoopConfiguration)
fs.delete(new Path(tableDir), true)
val tables = new HadoopTables(sc.hadoopConfiguration)
// Create simple table partitioned by day
val df1 = (
spark
.range(10L)
.select(
'id,
concat(lit("value"), 'id) as 'value,
when('id < 5L, Timestamp.valueOf("2021-01-19 00:00:00")).otherwise(Timestamp.valueOf("2021-01-20 00:00:00")) as 'ts
)
)
val schema = SparkSchemaUtil.convert(df1.schema)
val table = tables.create(
schema,
PartitionSpec.builderFor(schema).day("ts").build,
tableDir
)
// Get data files from a path
def getDataFiles(path: String) = {
fs
.globStatus(new Path(path, "*/*.parquet"))
.map({ status => HadoopInputFile.fromStatus(status, sc.hadoopConfiguration) })
.map({ inputFile =>
DataFiles
.builder(table.spec)
.withInputFile(inputFile)
.withMetrics(ParquetUtil.fileMetrics(inputFile, MetricsConfig.getDefault))
.withPartitionPath(new Path(inputFile.location).getParent.getName)
.build
})
.toSeq
}
// Write dataframe as raw parquet
(
df1
.withColumn("ts_day", date_format('ts, "yyyy-MM-dd"))
.repartition(2)
.sortWithinPartitions('ts)
.write
.partitionBy("ts_day")
.mode("overwrite")
.parquet(s"$tableDir/data/commit1")
)
// Append data files to iceberg table
val dataFiles = getDataFiles(s"$tableDir/data/commit1")
val append = table.newFastAppend
dataFiles.foreach(append.appendFile)
append.commit
table.refresh
// Table data appears OK
spark.read.format("iceberg").load(tableDir).show
// Issue an overwrite in which the appended datafiles are deleted, then re-added
val overwrite = table.newOverwrite
dataFiles.foreach(overwrite.deleteFile)
dataFiles.foreach(overwrite.addFile)
overwrite.commit
table.refresh
// Table data appears OK
spark.read.format("iceberg").load(tableDir).show
// Expire first snapshot (append) with file cleanup enabled
table.expireSnapshots.expireSnapshotId(table.snapshots.head.snapshotId).cleanExpiredFiles(true).commit
table.refresh
// Table data appears OK
spark.read.format("iceberg").load(tableDir).show
// Write new parquet data, with one new (2021-01-21) and one overwritten (2021-01-20) partition
(
spark
.range(5L, 15L)
.select(
'id,
concat(lit("value"), 'id) as 'value,
when('id < 10L, Timestamp.valueOf("2021-01-20 00:00:00")).otherwise(Timestamp.valueOf("2021-01-21 00:00:00")) as 'ts
)
.withColumn("ts_day", date_format('ts, "yyyy-MM-dd"))
.repartition(2)
.sortWithinPartitions('ts)
.write
.partitionBy("ts_day")
.mode("overwrite")
.parquet(s"$tableDir/data/commit2")
)
// Do an overwrite that deletes the old files from the overwritten partition
// (2021-01-20) and adds the fiels we just wrote for the overwritten and new
// partitions
//
val dataFiles2 = getDataFiles(s"$tableDir/data/commit2")
val overwrite = table.newOverwrite
dataFiles.filter({ file => LocalDate.ofEpochDay(file.partition.get(0, classOf[Integer]).toLong) == LocalDate.of(2021, 1, 20) }).foreach(overwrite.deleteFile)
dataFiles2.foreach(overwrite.addFile)
overwrite.commit
table.refresh
// Expire the second commit (the first overwrite)
table.expireSnapshots.expireSnapshotId(table.snapshots.head.snapshotId).cleanExpiredFiles(true).commit
table.refresh
// Throws an exception because the files from the original commit for the
// partition 2021-01-19 have been deleted, even though they were not affected
// by the most recent overwrite
spark.read.format("iceberg").load(tableDir).show
Clearly there’s user error here (we shouldn’t be deleting and re-adding the same files added in the previous snapshot), but it feels like iceberg is doing the wrong thing as well, as it deletes files that it still considers active. It feels like the right solution is either to:
- Reject the commit in step 4 with an exception
- Warn the user that they’re trying to both add and delete the same files and silently remove the affected files from the delete list
- Detect during the expiration that the files to be deleted are still active and prevent them from getting deleted
Issue Analytics
- State:
- Created 3 years ago
- Comments:5 (3 by maintainers)
Top Results From Across the Web
Recover snapshots from the Recycle Bin - AWS Documentation
Recycle Bin is a data recovery feature that enables you to restore accidentally deleted Amazon EBS snapshots and EBS-backed AMIs.
Read more >Deleting Snapshots - VMware Docs
Deleting a snapshot removes the snapshot from the Snapshot Manager. The snapshot files are consolidated and written to the parent snapshot ...
Read more >HOW TO RESTORE EXPIRED SNAPSHOT FROM OLD ...
What you can do here is, recover the cold backup on another system and extract what you have got there as snapshots. After...
Read more >OneFS: How to recover individual file(s) from Snapshots ... - Dell
You can easily recover your own files, before snapshot expiration, simply by copying an earlier version from the snapshot to the original ...
Read more >AWS Snapshot are not deleted once, when backup has expired.
Hi, I found a similar issue #1880. Could you check with whether the backup was already deleted when the snapshot still was found?...
Read more >
Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free
Top Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
There is an method “expire” which just returns the dataset of files that would be removed without removing files if you like, but if you just want to remove snapshots without cleaning files it should be essentially the same as the table API. The big difference in Table and Action implementations is in identifying which files need to be cleaned.
/**
I noticed that there’s no equivalent in
ExpireSnapshotsAction
toExpireSnapshots.cleanExpiredFiles()
; does that mean that the former always cleans expired files?