Pushdown doesn't work with nested fields
See original GitHub issueElasticsearch & ES-Hadoop 5.0.0-beta1, Spark 2.0.0
curl -XPOST localhost:9200/pushdown/pushdown -d '{"a":"b","c":{"d":"e"}}'
df = sqlContext.read.format("org.elasticsearch.spark.sql").load("pushdown/pushdown")
df.filter(df.a == "b").show()
as expected generates:
{"query":{"bool":{"must":[{"match_all":{}}],"filter":[{"exists":{"field":"a"}},{"match":{"a":"b"}}]}}}
df.filter(df.c.d == "e").show()
doesn’t generate any pushdown:
{"query":{"match_all":{}}}
Issue Analytics
- State:
- Created 7 years ago
- Reactions:3
- Comments:11 (5 by maintainers)
Top Results From Across the Web
What's new in Apache Spark 3.0 - predicate pushdown ...
Let's see how the predicate pushdown for the nested fields works in Apache Spark 3.0. Below you can find the code and the...
Read more >[#SPARK-17636] Parquet predicate pushdown for nested fields
There's a PushedFilters for a simple numeric field, but not for a numeric field inside a struct. Not sure if this is a...
Read more >Querying Parquet file nested column scan whole column even ...
1. SPARK-17636 shows that nested field involved in the predicate, will not trigger push down. What I experience is that even when other...
Read more >Apache Spark 3 and predicate pushdown for nested fields
Pushdown predicate for nested fields Check the blog post "What's new in Apache Spark 3.0 - predicate pushdown support for nested fields "...
Read more >Faster Queries on Nested Data - Trino
The work for this improvement is being tracked in this issue. Similar to Hive Connector, connector-level dereference pushdown can be extended to ...
Read more >Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start FreeTop Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Top GitHub Comments
I believe this is happen also for column/field pruning, and it looks a complicate issue, with the latest version (6.4.0), I see the root field is sent, but never the whole nested field.
If we look the code source: https://github.com/elastic/elasticsearch-hadoop/blob/master/spark/sql-20/src/main/scala/org/elasticsearch/spark/sql/DefaultSource.scala#L233 this plugin “forward” the column/field given by Spark to the search query builder, so the problem is not here.
Also, look like they (Spark team) have just done this for Parquet file format: https://github.com/apache/spark/pull/21320/files
If I’m reading this correctly, spark does not allow predicates for nested fields to be pushed down for non-hadoop backends in DSv1: https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSourceUtils.scala#L110. If that’s the case, we need #1801 before we can do this.