Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Spark 3.2 query's partition filter with UDF inside did not get pushed to BatchScan operator

See original GitHub issue

Hey, we are seeing Spark 3.2 query’s partition filter broken when there is a UDF inside. For example,

select * from tbl where dt between date_add('2022-01-01, -1) and '2022-01-01'

The query plan shows

...
spark_catalog.db.tbl[filters=dt IS NOT NULL, dt <= '2022-01-01']

Issue Analytics

State:
Created a year ago
Comments:30 (15 by maintainers)

Top GitHub Comments

2reactions

RussellSpitzercommented, Jun 13, 2022

Oh i’m silly, my coworker actually did all of this work

Here is the first PR @sunchao submitted a way long time ago 😃.

https://github.com/apache/spark/commit/3d08084022a4365966526216a616a3b760450884

https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/UnwrapCastInBinaryComparison.scala

There is actually a lot of documentation in that rule to explain when it get’s applied and the difficulties with doing it generically.

2reactions

RussellSpitzercommented, Jun 13, 2022

@puchengy The example you pasted shows there is a function applied to the column

see

(cast(dt#119 as date) >= 2022-06-05) 
 AND (dt#119 <= 2022-06-06))

In general this is a problem for Datasources, although in 3.2? 3.3? a rule was added to force cast the literal side of the predicate but I think that’s still limited only to certain types of literals. Basically the issue is that if Spark thinks a function like cast must be applied to a DataSource Column then it is not allowed to push down that predicate. It can only push down predicates where the columns are not modified in any way before being compared to a literal.

I wrote about this a long time ago here

The issue in your particular query is that dt is a string and the output of your function is a date. Spark needs to resolve this mismatch so it casts dt as a date so the types match. The types now match, but it is impossible to pushdown the predicate. To fix it you cast the literal output as string before Spark has a chance to cast dt. You’ll notice that in your example the other predicate is always pushed down correctly because it is a literal string being compared to a string column.

This has been the case in Spark for external datasources for a very long time so I don’t think it’s a new issue.

In this case you fix it by casting the “literal” part of the predicate to match the column type.

select * from tbl where dt between cast(date_add('2022-01-01, -1) as String) and '2022-01-01'

Top Results From Across the Web

Migration Guide: SQL, Datasets and DataFrame - Spark 3.2.2 ...

In Spark 3.2, create/alter view will fail if the input query output columns contain auto-generated alias. This is necessary to make sure the...

Fast Filtering with Spark PartitionFilters and PushedFilters

Spark doesn't need to push the country filter when working off of partitionedDF because it can use a partition filter that is a...

Partition Filters - Spark Advanced Topics - GitHub Pages

A simple equality filter gets pushed down to the batch scan and enables Spark to only scan the files where dateint = 20211101...

Databricks Runtime 11.3 LTS - Azure - Microsoft Learn

The value of initial partitions to scan has been increased to 10 for selective query with take/tail/limit in Photon-enabled clusters and LIMIT in...

PySpark UDF (User Defined Function) - Spark by {Examples}

Note: UDF's are the most expensive operations hence use them only you have no choice and when essential. In the later section of...

Troubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.

Start Free

Top Related Reddit Thread

No results found

Top Related Tweet

No results found

Top Related Dev.to Post

No results found

Spark 3.2 query's partition filter with UDF inside did not get pushed to BatchScan operator

Issue Analytics

Top GitHub Comments

Top Results From Across the Web

Top Related Medium Post

Top Related StackOverflow Question

Troubleshoot Live Code

Top Related Reddit Thread

Top Related Hackernoon Post

Top Related Tweet

Top Related Dev.to Post

Top Related Hashnode Post

Proposal: FlinkSQL supports partition transform by computed columns

flink sql insert iceberg table array, then select the table array values is same.