iceberg v2 table cannot expire delete files after rewrite datafile action
See original GitHub issueI’m using v2 format iceberg table. When i use spark3.2 rewrite iceberg datafile (extrally add an where cause statement) . And i use expire statement to expire old delefiles, i see only the old small data files are deleted ,but the ‘equalitly delete file’ and ‘position delete file’ cannot be deleted. They will still remain in filesystem. rewrite datafile sql is
CALL hive_prod.system.rewrite_data_files(table => 'test.mock_pre_dwv'
, where => 'dt >= "2022-06-04" '
, options => map (
'delete-file-threshold','1'
,'min-input-files','1'
,'partial-progress.enabled','true'
,'max-concurrent-file-group-rewrites','20'
)
);
expire snapshot sql is
CALL hive_prod.system.expire_snapshots(table => 'test.mock_pre_dwv', older_than => timestamp '2022-06-08 11:31:49',retain_last => 1) ;
Spark expire action execute result:
{
“deleted_data_files_count”: 5
“deleted_position_delete_files_count”: 0,
“deleted_equality_delete_files_count”: 0,
“deleted_manifest_files_count”: 588,
“deleted_manifest_lists_count”: 319
}
Issue Analytics
- State:
- Created a year ago
- Comments:7
Top Results From Across the Web
Optimizing Iceberg tables - Amazon Athena
The OPTIMIZE table REWRITE DATA compaction action rewrites data files into a more optimized layout based on their size and number of associated...
Read more >Maintenance - Apache Iceberg
Regularly expiring snapshots is recommended to delete data files that are no longer needed, and to keep the size of table metadata small....
Read more >Maintaining Iceberg Tables - Compaction, Expiring Snapshots ...
Note that any manifest lists, manifests, and data files associated with an expired snapshot will be deleted when you delete a snapshot -...
Read more >Getting Started With Apache Iceberg - DZone Refcardz
Write-audit-publish (WAP) is a pattern where data is written to a table but is not initially committed. Then validation logic occurs — if...
Read more >Table Maintenance: The Key To Keeping Your Iceberg Tables ...
Creating a Sample Table · Rewriting Data Files · Expiring Snapshots · Removing Orphan Files · Rewriting Manifest Files.
Read more >Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start FreeTop Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Top GitHub Comments
I think pr #3990 wont work, it is too much to do in the commit phase, and we need a spark action.
I think the way to do this is to write a Spark job that finds ‘dangling’ delete files (that is, delete files that don’t point to any live data file). I think once this pr is in: https://github.com/apache/iceberg/pull/4812, we can implement the new Spark action ‘removeDanglingDeleteFile’. (This Spark action was mentioned in the original delete file design doc https://docs.google.com/document/d/1-EyKSfwd_W9iI5jrzAvomVw3w1mb_kayVNT7f2I-SUg/edit#heading=h.fxypqdd7zxcj but no implementation details, I am thinking this could be one way to do it)