question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Spark 3.2 query's partition filter with UDF inside did not get pushed to BatchScan operator

See original GitHub issue

Hey, we are seeing Spark 3.2 query’s partition filter broken when there is a UDF inside. For example,

select * from tbl where dt between date_add('2022-01-01, -1) and '2022-01-01'

The query plan shows

...
spark_catalog.db.tbl[filters=dt IS NOT NULL, dt <= '2022-01-01']

Issue Analytics

  • State:closed
  • Created a year ago
  • Comments:30 (15 by maintainers)

github_iconTop GitHub Comments

2reactions
RussellSpitzercommented, Jun 13, 2022

Oh i’m silly, my coworker actually did all of this work

Here is the first PR @sunchao submitted a way long time ago 😃.

https://github.com/apache/spark/commit/3d08084022a4365966526216a616a3b760450884

https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/UnwrapCastInBinaryComparison.scala

There is actually a lot of documentation in that rule to explain when it get’s applied and the difficulties with doing it generically.

2reactions
RussellSpitzercommented, Jun 13, 2022

@puchengy The example you pasted shows there is a function applied to the column

see

(cast(dt#119 as date) >= 2022-06-05) 
 AND (dt#119 <= 2022-06-06))

In general this is a problem for Datasources, although in 3.2? 3.3? a rule was added to force cast the literal side of the predicate but I think that’s still limited only to certain types of literals. Basically the issue is that if Spark thinks a function like cast must be applied to a DataSource Column then it is not allowed to push down that predicate. It can only push down predicates where the columns are not modified in any way before being compared to a literal.

I wrote about this a long time ago here

The issue in your particular query is that dt is a string and the output of your function is a date. Spark needs to resolve this mismatch so it casts dt as a date so the types match. The types now match, but it is impossible to pushdown the predicate. To fix it you cast the literal output as string before Spark has a chance to cast dt. You’ll notice that in your example the other predicate is always pushed down correctly because it is a literal string being compared to a string column.

This has been the case in Spark for external datasources for a very long time so I don’t think it’s a new issue.

In this case you fix it by casting the “literal” part of the predicate to match the column type.

select * from tbl where dt between cast(date_add('2022-01-01, -1) as String) and '2022-01-01'
Read more comments on GitHub >

github_iconTop Results From Across the Web

Migration Guide: SQL, Datasets and DataFrame - Spark 3.2.2 ...
In Spark 3.2, create/alter view will fail if the input query output columns contain auto-generated alias. This is necessary to make sure the...
Read more >
Fast Filtering with Spark PartitionFilters and PushedFilters
Spark doesn't need to push the country filter when working off of partitionedDF because it can use a partition filter that is a...
Read more >
Partition Filters - Spark Advanced Topics - GitHub Pages
A simple equality filter gets pushed down to the batch scan and enables Spark to only scan the files where dateint = 20211101...
Read more >
Databricks Runtime 11.3 LTS - Azure - Microsoft Learn
The value of initial partitions to scan has been increased to 10 for selective query with take/tail/limit in Photon-enabled clusters and LIMIT in...
Read more >
PySpark UDF (User Defined Function) - Spark by {Examples}
Note: UDF's are the most expensive operations hence use them only you have no choice and when essential. In the later section of...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found