Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Spark: Iceberg Data Source does not support struct literal predicates

See original GitHub issue

From the discussion in https://github.com/apache/iceberg/pull/5113 with @huaxingao , I found this behavior:

For Iceberg table:

select * from table where table.struct_field = struct(10)

org.apache.spark.sql.AnalysisException: cannot resolve ‘(table.struct_field = struct(10))’ due to data type mismatch: differing types in ‘(table.struct_field = struct(1))’ (structnested:int and structcol1:int).; line 1 pos 39;

select * from table where table.struct_field in (struct(10))

java.lang.IllegalArgumentException: Cannot create expression literal from org.apache.spark.sql.catalyst.expressions.GenericRowWithSchema: [1]
  at org.apache.iceberg.expressions.Literals.from(Literals.java:87)
  at org.apache.iceberg.expressions.UnboundPredicate.<init>(UnboundPredicate.java:40)
  at org.apache.iceberg.expressions.Expressions.equal(Expressions.java:175)
  at org.apache.iceberg.spark.SparkFilters.handleEqual(SparkFilters.java:239)
  at org.apache.iceberg.spark.SparkFilters.convert(SparkFilters.java:152)
  at org.apache.iceberg.spark.source.SparkScanBuilder.pushFilters(SparkScanBuilder.java:106)
  at org.apache.spark.sql.execution.datasources.v2.PushDownUtils$.pushFilters(PushDownUtils.scala:69)
  at org.apache.spark.sql.execution.datasources.v2.V2ScanRelationPushDown$$anonfun$pushDownFilters$1.applyOrElse(V2ScanRelationPushDown.scala:60)
  at org.apache.spark.sql.execution.datasources.v2.V2ScanRelationPushDown$$anonfun$pushDownFilters$1.applyOrElse(V2ScanRelationPushDown.scala:47)

For non-Iceberg table:

spark.sql("select * from test_struct_non_iceberg where struct_field in(struct(10))").show
+------------+
|struct_field|
+------------+
|        {10}|
+------------+


scala> spark.sql("select * from test_struct_non_iceberg where struct_field = struct(10)").show
+------------+
|struct_field|
+------------+
|        {10}|
+------------+

Iceberg cannot handle complex predicate filters (as it does not collect metrics for anything other than primitive columns). So maybe we should not even push down the filters in SparkScanBuilder. There may also be other problem (the returned schema not matching)

Issue Analytics

State:
Created a year ago
Reactions:2
Comments:10 (5 by maintainers)

Top GitHub Comments

4reactions

huaxingaocommented, Jun 26, 2022

You are right. We shouldn’t push down the complex predicate filters. I think we should catch the IllegalArgumentException here and then the complex predicate filter won’t be pushed. Similar as this behavior.

I will open a PR to fix this.

0reactions

szehon-hocommented, Jul 6, 2022

Closed by #5204 , we can work on the optimizations later.

Top Results From Across the Web

Spark Queries - Apache Iceberg

Spark Queries. To use Iceberg in Spark, first configure Spark catalogs. Iceberg uses Apache Spark's DataSourceV2 API for data source and catalog ...

Iceberg Table Spec

Evolution – Tables will support full schema and partition spec evolution. ... Reads will be planned using predicates on data values, not partition...

Java API | Apache Iceberg

The main purpose of the Iceberg API is to manage table metadata, like schema, partition spec, metadata, and data files that store table...

Spark Writes - Apache Iceberg

Iceberg uses Apache Spark's DataSourceV2 API for data source and catalog implementations. Spark DSv2 is an evolving API with different levels of support...

Query Apache Iceberg tables | BigQuery - Google Cloud

Apache Iceberg is an open source table format that supports petabyte scale data tables. The Iceberg open specification lets you run multiple ...