question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Spark: SparkSQL call procedures blocking(expire_snapshots and delete orphan files)

See original GitHub issue

Spark version: 3.2.1 or 3.1.2 Spark SQL statements: CALL hive_prod.system.expire_snapshots(table => 'test.iceberg_test_col_data_with_dt_02', older_than => timestamp '2022-05-18 19:02:00.595',retain_last => 1); CALL hive_prod.system.remove_orphan_files(table => 'test.iceberg_test_col_data_with_dt_02', dry_run => true); Execute statement may occour spark task blocking(last task of collectAsList operator)

Issue Analytics

  • State:open
  • Created a year ago
  • Comments:13 (1 by maintainers)

github_iconTop GitHub Comments

1reaction
eric666666commented, Jun 15, 2022

我也这个问题了,过渡和删除孤儿文件都在最后一个,10小时反应

我解决掉了这个问题,用local模式跑会阻塞,spark on yarn是正常的 如果本地模式要解决阻塞有两种方式,1种改spark源码,1种改iceberg的源码,改spark源码的方式比较通用解决这个阻塞问题 改一下org.apache.spark.scheduler.TaskSetManager#addPendingTask这个方法,把spark task任务优先级这段注释一下就可以了 image

0reactions
eric666666commented, Jun 27, 2022

我加你啦,之前我排查了好像和delete file的manifest在重写和快照过期的时候,sequence number小于最新分区data file的sequence number也没有去删除,这个PR我也看了,有个大佬说这个PR改动比较大,你这边patch后有持续跑一段时间吗?

Read more comments on GitHub >

github_iconTop Results From Across the Web

spark action expireSnapshots and removeOrphanFiles block ...
The spark procedure has to wait until the computation of the files to actually be deleted. Currently, the results of that computation is ......
Read more >
Spark Procedures - Apache Iceberg
This procedure will remove old snapshots and data files which are uniquely required by those old snapshots. This means the expire_snapshots procedure will...
Read more >
Deep Dive into Iceberg SQL Extensions - Dremio
This talk will focus on the Iceberg SQL extensions, a recent development in the Iceberg community to efficiently manage tables through SQL. In ......
Read more >
Apache Spark with Apache Iceberg - a way to boost your data ...
Apache Iceberg provides two methods for spark users to interact with ... To avoid this, simply remove orphan files using Java or Scala...
Read more >
Getting Started With Apache Iceberg - DZone Refcardz
Hive enabled any developer with SQL skills to write jobs for processing ... Commonly performed operations are provided by Iceberg as Spark procedure...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found