question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Spark: Iceberg Data Source does not support struct literal predicates

See original GitHub issue

From the discussion in https://github.com/apache/iceberg/pull/5113 with @huaxingao , I found this behavior:

For Iceberg table:

select * from table where table.struct_field = struct(10)

org.apache.spark.sql.AnalysisException: cannot resolve ‘(table.struct_field = struct(10))’ due to data type mismatch: differing types in ‘(table.struct_field = struct(1))’ (structnested:int and structcol1:int).; line 1 pos 39;

select * from table where table.struct_field in (struct(10))

java.lang.IllegalArgumentException: Cannot create expression literal from org.apache.spark.sql.catalyst.expressions.GenericRowWithSchema: [1]
  at org.apache.iceberg.expressions.Literals.from(Literals.java:87)
  at org.apache.iceberg.expressions.UnboundPredicate.<init>(UnboundPredicate.java:40)
  at org.apache.iceberg.expressions.Expressions.equal(Expressions.java:175)
  at org.apache.iceberg.spark.SparkFilters.handleEqual(SparkFilters.java:239)
  at org.apache.iceberg.spark.SparkFilters.convert(SparkFilters.java:152)
  at org.apache.iceberg.spark.source.SparkScanBuilder.pushFilters(SparkScanBuilder.java:106)
  at org.apache.spark.sql.execution.datasources.v2.PushDownUtils$.pushFilters(PushDownUtils.scala:69)
  at org.apache.spark.sql.execution.datasources.v2.V2ScanRelationPushDown$$anonfun$pushDownFilters$1.applyOrElse(V2ScanRelationPushDown.scala:60)
  at org.apache.spark.sql.execution.datasources.v2.V2ScanRelationPushDown$$anonfun$pushDownFilters$1.applyOrElse(V2ScanRelationPushDown.scala:47)

For non-Iceberg table:

spark.sql("select * from test_struct_non_iceberg where struct_field in(struct(10))").show
+------------+
|struct_field|
+------------+
|        {10}|
+------------+


scala> spark.sql("select * from test_struct_non_iceberg where struct_field = struct(10)").show
+------------+
|struct_field|
+------------+
|        {10}|
+------------+

Iceberg cannot handle complex predicate filters (as it does not collect metrics for anything other than primitive columns). So maybe we should not even push down the filters in SparkScanBuilder. There may also be other problem (the returned schema not matching)

Issue Analytics

  • State:closed
  • Created a year ago
  • Reactions:2
  • Comments:10 (5 by maintainers)

github_iconTop GitHub Comments

4reactions
huaxingaocommented, Jun 26, 2022

You are right. We shouldn’t push down the complex predicate filters. I think we should catch the IllegalArgumentException here and then the complex predicate filter won’t be pushed. Similar as this behavior.

I will open a PR to fix this.

0reactions
szehon-hocommented, Jul 6, 2022

Closed by #5204 , we can work on the optimizations later.

Read more comments on GitHub >

github_iconTop Results From Across the Web

Spark Queries - Apache Iceberg
Spark Queries. To use Iceberg in Spark, first configure Spark catalogs. Iceberg uses Apache Spark's DataSourceV2 API for data source and catalog ...
Read more >
Iceberg Table Spec
Evolution – Tables will support full schema and partition spec evolution. ... Reads will be planned using predicates on data values, not partition...
Read more >
Java API | Apache Iceberg
The main purpose of the Iceberg API is to manage table metadata, like schema, partition spec, metadata, and data files that store table...
Read more >
Spark Writes - Apache Iceberg
Iceberg uses Apache Spark's DataSourceV2 API for data source and catalog implementations. Spark DSv2 is an evolving API with different levels of support...
Read more >
Query Apache Iceberg tables | BigQuery - Google Cloud
Apache Iceberg is an open source table format that supports petabyte scale data tables. The Iceberg open specification lets you run multiple ...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found