question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Pushdown doesn't work with nested fields

See original GitHub issue

Elasticsearch & ES-Hadoop 5.0.0-beta1, Spark 2.0.0

curl -XPOST localhost:9200/pushdown/pushdown -d '{"a":"b","c":{"d":"e"}}'
df = sqlContext.read.format("org.elasticsearch.spark.sql").load("pushdown/pushdown")
df.filter(df.a == "b").show()

as expected generates:

{"query":{"bool":{"must":[{"match_all":{}}],"filter":[{"exists":{"field":"a"}},{"match":{"a":"b"}}]}}}
df.filter(df.c.d == "e").show()

doesn’t generate any pushdown:

{"query":{"match_all":{}}}

Issue Analytics

  • State:open
  • Created 7 years ago
  • Reactions:3
  • Comments:11 (5 by maintainers)

github_iconTop GitHub Comments

2reactions
ebuildycommented, Sep 5, 2018

I believe this is happen also for column/field pruning, and it looks a complicate issue, with the latest version (6.4.0), I see the root field is sent, but never the whole nested field.

If we look the code source: https://github.com/elastic/elasticsearch-hadoop/blob/master/spark/sql-20/src/main/scala/org/elasticsearch/spark/sql/DefaultSource.scala#L233 this plugin “forward” the column/field given by Spark to the search query builder, so the problem is not here.

Also, look like they (Spark team) have just done this for Parquet file format: https://github.com/apache/spark/pull/21320/files

0reactions
masseykecommented, Jan 4, 2022

If I’m reading this correctly, spark does not allow predicates for nested fields to be pushed down for non-hadoop backends in DSv1: https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSourceUtils.scala#L110. If that’s the case, we need #1801 before we can do this.

Read more comments on GitHub >

github_iconTop Results From Across the Web

What's new in Apache Spark 3.0 - predicate pushdown ...
Let's see how the predicate pushdown for the nested fields works in Apache Spark 3.0. Below you can find the code and the...
Read more >
[#SPARK-17636] Parquet predicate pushdown for nested fields
There's a PushedFilters for a simple numeric field, but not for a numeric field inside a struct. Not sure if this is a...
Read more >
Querying Parquet file nested column scan whole column even ...
1. SPARK-17636 shows that nested field involved in the predicate, will not trigger push down. What I experience is that even when other...
Read more >
Apache Spark 3 and predicate pushdown for nested fields
Pushdown predicate for nested fields Check the blog post "What's new in Apache Spark 3.0 - predicate pushdown support for nested fields "...
Read more >
Faster Queries on Nested Data - Trino
The work for this improvement is being tracked in this issue. Similar to Hive Connector, connector-level dereference pushdown can be extended to ...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found