Spark: SparkSQL call procedures blocking(expire_snapshots and delete orphan files)
See original GitHub issueSpark version: 3.2.1 or 3.1.2
Spark SQL statements:
CALL hive_prod.system.expire_snapshots(table => 'test.iceberg_test_col_data_with_dt_02', older_than => timestamp '2022-05-18 19:02:00.595',retain_last => 1);
CALL hive_prod.system.remove_orphan_files(table => 'test.iceberg_test_col_data_with_dt_02', dry_run => true);
Execute statement may occour spark task blocking(last task of collectAsList operator)
Issue Analytics
- State:
- Created a year ago
- Comments:13 (1 by maintainers)
Top Results From Across the Web
spark action expireSnapshots and removeOrphanFiles block ...
The spark procedure has to wait until the computation of the files to actually be deleted. Currently, the results of that computation is ......
Read more >Spark Procedures - Apache Iceberg
This procedure will remove old snapshots and data files which are uniquely required by those old snapshots. This means the expire_snapshots procedure will...
Read more >Deep Dive into Iceberg SQL Extensions - Dremio
This talk will focus on the Iceberg SQL extensions, a recent development in the Iceberg community to efficiently manage tables through SQL. In ......
Read more >Apache Spark with Apache Iceberg - a way to boost your data ...
Apache Iceberg provides two methods for spark users to interact with ... To avoid this, simply remove orphan files using Java or Scala...
Read more >Getting Started With Apache Iceberg - DZone Refcardz
Hive enabled any developer with SQL skills to write jobs for processing ... Commonly performed operations are provided by Iceberg as Spark procedure...
Read more >Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start FreeTop Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Top GitHub Comments
我解决掉了这个问题,用local模式跑会阻塞,spark on yarn是正常的 如果本地模式要解决阻塞有两种方式,1种改spark源码,1种改iceberg的源码,改spark源码的方式比较通用解决这个阻塞问题 改一下org.apache.spark.scheduler.TaskSetManager#addPendingTask这个方法,把spark task任务优先级这段注释一下就可以了
我加你啦,之前我排查了好像和delete file的manifest在重写和快照过期的时候,sequence number小于最新分区data file的sequence number也没有去删除,这个PR我也看了,有个大佬说这个PR改动比较大,你这边patch后有持续跑一段时间吗?