question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

iceberg v2 table cannot expire delete files after rewrite datafile action

See original GitHub issue

I’m using v2 format iceberg table. When i use spark3.2 rewrite iceberg datafile (extrally add an where cause statement) . And i use expire statement to expire old delefiles, i see only the old small data files are deleted ,but the ‘equalitly delete file’ and ‘position delete file’ cannot be deleted. They will still remain in filesystem. rewrite datafile sql is

CALL hive_prod.system.rewrite_data_files(table => 'test.mock_pre_dwv'
, where  => 'dt >= "2022-06-04" '
, options => map (
    'delete-file-threshold','1'
    ,'min-input-files','1'
    ,'partial-progress.enabled','true'
    ,'max-concurrent-file-group-rewrites','20'
    )
 );

expire snapshot sql is

CALL hive_prod.system.expire_snapshots(table => 'test.mock_pre_dwv', older_than => timestamp '2022-06-08 11:31:49',retain_last => 1) ; Spark expire action execute result: { “deleted_data_files_count”: 5 “deleted_position_delete_files_count”: 0, “deleted_equality_delete_files_count”: 0, “deleted_manifest_files_count”: 588, “deleted_manifest_lists_count”: 319 }

Issue Analytics

  • State:open
  • Created a year ago
  • Comments:7

github_iconTop GitHub Comments

1reaction
szehon-hocommented, Jun 22, 2022

I think pr #3990 wont work, it is too much to do in the commit phase, and we need a spark action.

0reactions
szehon-hocommented, Jul 7, 2022

I think the way to do this is to write a Spark job that finds ‘dangling’ delete files (that is, delete files that don’t point to any live data file). I think once this pr is in: https://github.com/apache/iceberg/pull/4812, we can implement the new Spark action ‘removeDanglingDeleteFile’. (This Spark action was mentioned in the original delete file design doc https://docs.google.com/document/d/1-EyKSfwd_W9iI5jrzAvomVw3w1mb_kayVNT7f2I-SUg/edit#heading=h.fxypqdd7zxcj but no implementation details, I am thinking this could be one way to do it)

Read more comments on GitHub >

github_iconTop Results From Across the Web

Optimizing Iceberg tables - Amazon Athena
The OPTIMIZE table REWRITE DATA compaction action rewrites data files into a more optimized layout based on their size and number of associated...
Read more >
Maintenance - Apache Iceberg
Regularly expiring snapshots is recommended to delete data files that are no longer needed, and to keep the size of table metadata small....
Read more >
Maintaining Iceberg Tables - Compaction, Expiring Snapshots ...
Note that any manifest lists, manifests, and data files associated with an expired snapshot will be deleted when you delete a snapshot -...
Read more >
Getting Started With Apache Iceberg - DZone Refcardz
Write-audit-publish (WAP) is a pattern where data is written to a table but is not initially committed. Then validation logic occurs — if...
Read more >
Table Maintenance: The Key To Keeping Your Iceberg Tables ...
Creating a Sample Table · Rewriting Data Files · Expiring Snapshots · Removing Orphan Files · Rewriting Manifest Files.
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found