Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Spark: SparkSQL call procedures blocking(expire_snapshots and delete orphan files)

See original GitHub issue

Spark version: 3.2.1 or 3.1.2 Spark SQL statements: CALL hive_prod.system.expire_snapshots(table => 'test.iceberg_test_col_data_with_dt_02', older_than => timestamp '2022-05-18 19:02:00.595',retain_last => 1); CALL hive_prod.system.remove_orphan_files(table => 'test.iceberg_test_col_data_with_dt_02', dry_run => true); Execute statement may occour spark task blocking(last task of collectAsList operator)

Issue Analytics

State:
Created a year ago
Comments:13 (1 by maintainers)

Top GitHub Comments

1reaction

eric666666commented, Jun 15, 2022

我也这个问题了，过渡和删除孤儿文件都在最后一个，10小时反应

我解决掉了这个问题，用local模式跑会阻塞，spark on yarn是正常的如果本地模式要解决阻塞有两种方式，1种改spark源码，1种改iceberg的源码，改spark源码的方式比较通用解决这个阻塞问题改一下org.apache.spark.scheduler.TaskSetManager#addPendingTask这个方法，把spark task任务优先级这段注释一下就可以了

0reactions

eric666666commented, Jun 27, 2022

我加你啦，之前我排查了好像和delete file的manifest在重写和快照过期的时候，sequence number小于最新分区data file的sequence number也没有去删除，这个PR我也看了，有个大佬说这个PR改动比较大，你这边patch后有持续跑一段时间吗？

Top Results From Across the Web

spark action expireSnapshots and removeOrphanFiles block ...

The spark procedure has to wait until the computation of the files to actually be deleted. Currently, the results of that computation is ......

Spark Procedures - Apache Iceberg

This procedure will remove old snapshots and data files which are uniquely required by those old snapshots. This means the expire_snapshots procedure will...

Deep Dive into Iceberg SQL Extensions - Dremio

This talk will focus on the Iceberg SQL extensions, a recent development in the Iceberg community to efficiently manage tables through SQL. In ......

Apache Spark with Apache Iceberg - a way to boost your data ...

Apache Iceberg provides two methods for spark users to interact with ... To avoid this, simply remove orphan files using Java or Scala...

Getting Started With Apache Iceberg - DZone Refcardz

Hive enabled any developer with SQL skills to write jobs for processing ... Commonly performed operations are provided by Iceberg as Spark procedure...