question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

[PROPOSAL]: Index nested fields

See original GitHub issue

Problem description

This design proposal is for adding support for nested fields in both indexedColumns and includedColumns.

Currently, Hyperspace does not support nested fields like structs or structured arrays.

Having the possibility to use nested fields with Hyperspace would be of great benefit to everybody.

Goal

The user should be able to create and index with nested fields and use it in queries.

For example:

hs.createIndex(
  df, 
  IndexConfig(
    "idx_nested", 
    indexedColumns = Seq("nested.nst.field1"), 
    includedColumns = Seq("id", "name", "nested.nst.field2")
  )
)

Proposed solution

Using the same Hyperspace APIs indexes over nested fields should work as for other simple fields.

Given the nestedDataset dataset with schema

root
 |-- id: integer (nullable = true)
 |-- name: string (nullable = true)
 |-- nested: struct (nullable = true)
 |    |-- field1: string (nullable = true)
 |    |-- nst: struct (nullable = true)
 |    |    |-- field1: string (nullable = true)
 |    |    |-- field2: string (nullable = true)

and the following data

+---+-----+-----------------+
|id |name |nested           |
+---+-----+-----------------+
|2  |name2|[va2, [wa2, wb2]]|
|1  |name1|[va1, [wa1, wb1]]|
+---+-----+-----------------+

This is the way and index on field1 field with other 2 projection fields can be created

hs.createIndex(
  nestedDataset, 
  IndexConfig(
    "idx_nested", 
    indexedColumns = Seq("nested.nst.field1"), 
    includedColumns = Seq("id", "name", "nested.nst.field2")))

Adding the support for nested fields impacts the following areas:

  • Validate nested column names
  • Modify the create index action
  • Modify the filter and rule index functions

Creating the index

Given the dataset defined above with the listed data, after doing

hs.createIndex(
  df, 
  IndexConfig(
    "idx_nested", 
    indexedColumns = Seq("nested.nst.field1"), 
    includedColumns = Seq("id", "name", "nested.nst.field2")))

the following dataset will be created

root
 |-- nested__nst__field1: string (nullable = true)
 |-- id: integer (nullable = true)
 |-- name: string (nullable = true)
 |-- nested__nst__field2: string (nullable = true)
+-------------------+---+-----+-------------------+
|nested__nst__field1| id| name|nested__nst__field2|
+-------------------+---+-----+-------------------+
|                wa1|  1|name1|                wb1|
|                wa2|  2|name2|                wb2|
+-------------------+---+-----+-------------------+

It is important to understand that the name of the field of the index column is a non-nested column and due to parquet quirkiness on using . (dot) in the field name, it has to be properly renamed and at query time projected as it was.

Search query

Given the following search/filter query

df.filter(df("nested.nst.field1") === "wa1").select("id", "name", "nested.nst.field2")

The optimized and spark plans without index are

Project [id#100, name#101, nested#102.nst.field2]
+- Filter (isnotnull(nested#102) && (nested#102.nst.field1 = wa1))
   +- Relation[id#100,name#101,nested#102] parquet
Project [id#100, name#101, nested#102.nst.field2]
+- Filter (isnotnull(nested#102) && (nested#102.nst.field1 = wa1))
   +- FileScan parquet [id#100,name#101,nested#102] Batched: false, Format: Parquet, 
        Location: InMemoryFileIndex[file:/..../tableN2], PartitionFilters: [], 
        PushedFilters: [IsNotNull(nested)], ReadSchema:
        struct<id:int,name:string,nested:struct<field1:string,nst:struct<field1:string,field2:string>>>

The transformed optimized and spark plans should look like

Project [id#1, name#2, nested__nst__field2#3]
+- Filter (isnotnull(nested__nst__field1#0) && (nested__nst__field1#0 = wa1))
   +- Relation[nested__nst__field1#0,id#1,name#2,nested__nst__field2#3]
        Hyperspace(Type: CI, Name: idx_nested, LogVersion: 1)
Project [id#1, name#2, nested__nst__field2#3]
+- Filter (isnotnull(nested__nst__field1#0) && (nested__nst__field1#0 = wa1))
   +- FileScan parquet [nested__nst__field1#0,id#1,name#2,nested__nst__field2#3] Batched: false, 
        Format: Parquet, Location: InMemoryFileIndex[file:/..../spark_warehouse/indexes/idx_nested/v__=0],
        PartitionFilters: [], PushedFilters: [IsNotNull(nested__nst__field1)], ReadSchema: 
        struct<nested__nst__field1:string,id:int,name:string,nested__nst__field2:string>

Complexities

Transforming the plan

Filters inside the plan must be modified to accomodate the index schema not the data schema - the flattened schema not the nested field. Instead of accessing the field with GetStructField(GetStructField(AttributeReference)) it must directly access with AttributeReference.

Given the query plan

Project [id#100, name#101, nested#102.nst.field2]
+- Filter (isnotnull(nested#102) && (nested#102.nst.field1 = wa1))
   +- Relation[id#100,name#101,nested#102] parquet

The filter must be modified from

Filter (isnotnull(nested#102) && (nested#102.nst.field1 = wa1))

to

Filter (isnotnull(nested__nst__field1#0) && (nested__nst__field1#0 = wa1))

The projection from

Project [id#100, name#101, nested#102.nst.field2]

to

Project [id#1, name#2, nested__nst__field2#3]

The relation from

Relation[id#100,name#101,nested#102] parquet

to

Relation[nested__nst__field1#0, id#1,name#2, nested__nst__field2#3] Hyperspace(Type: CI, 
  Name: idx_nested, LogVersion: 1)

Hybrid scans

The transforment plans for Hybrid scans will have need to accomodate the union between the latest files arriving in the dataset having the following schema

root
 |-- id: integer (nullable = true)
 |-- name: string (nullable = true)
 |-- nested: struct (nullable = true)
 |    |-- field1: string (nullable = true)
 |    |-- nst: struct (nullable = true)
 |    |    |-- field1: string (nullable = true)
 |    |    |-- field2: string (nullable = true)

and the index schema

root
 |-- nested__nst__field1: string (nullable = true)
 |-- id: integer (nullable = true)
 |-- name: string (nullable = true)
 |-- nested__nst__field2: string (nullable = true)

Search queries

Files Appended

The optimized plan

Union
:- Project [id#1, name#2, nested__nst__field2#3]
:  +- Filter (isnotnull(nested__nst__field1#0) && (nested__nst__field1#0 = wa1))
:     +- Relation[nested__nst__field1#0,id#1,name#2,nested__nst__field2#3] 
:        Hyperspace(Type: CI, Name: idx_nested, LogVersion: 1)
+- Project [id#100, name#101, nested#102.nst.field2]
   +- Filter (isnotnull(nested#102) && (nested#102.nst.field1 = wa1))
      +- Relation[id#1,name#2,nested#102] parquet

The Spark plan

Union
:- Project [id#1, name#2, nested__nst__field2#3]
:  +- Filter (isnotnull(nested__nst__field1#0) && (nested__nst__field1#0 = wa1))
:     +- FileScan parquet [nested__nst__field1#0,id#1,name#2,nested__nst__field2#3] Batched: false, 
:          Format: Parquet, Location: InMemoryFileIndex[file:/..../spark_warehouse/indexes/idx_nested/v__=0], 
:          PartitionFilters: [], PushedFilters: [IsNotNull(nested__nst__field1)], ReadSchema: 
:          struct<nested__nst__field1:string,id:int,name:string,nested__nst__field2:string>
+- Project [id#100, name#101, nested#102.nst.field2]
   +- Filter (isnotnull(nested#102) && (nested#102.nst.field1 = wa1))
      +- FileScan parquet [id#100,name#101,nested#102] Batched: false, Format: Parquet, 
           Location: InMemoryFileIndex[file:/..../tableN2/appended_file.parquet], 
           PartitionFilters: [], PushedFilters: [IsNotNull(nested)], ReadSchema: 
           struct<id:int,name:string,nested:struct<field1:string,nst:struct<field1:string,field2:string>>>

Files Deleted

The optimized plan

Project [id#1, name#2, nested__nst__field2#3]
+- Filter (isnotnull(nested__nst__field1#0) && (nested__nst__field1#0 = wa1))
   +- Project [nested__nst__field1#0, id#1, name#2, nested__nst__field2#3]
      +- Filter NOT (_data_file_id#368L = 3)
         +- Relation[nested__nst__field1#0,id#1,name#2,nested__nst__field2#3,_data_file_id#368L] 
              Hyperspace(Type: CI, Name: indexWithLineage, LogVersion: 1)

The Spark plan

Project [id#1, name#2, nested__nst__field2#3]
+- Filter (isnotnull(nested__nst__field1#0) && (nested__nst__field1#0 = wa1))
   +- Project [nested__nst__field1#0, id#1, name#2, nested__nst__field2#3]
      +- Filter NOT (_data_file_id#368L = 3)
         +- FileScan parquet [nested__nst__field1#0,id#1,name#2,nested__nst__field2#3] Batched: false, 
              Format: Parquet, Location: InMemoryFileIndex[file:/..../spark_warehouse/indexes/idx_nested/v__=0], 
              PartitionFilters: [], PushedFilters: [IsNotNull(nested__nst__field1)], ReadSchema: 
              struct<nested__nst__field1:string,id:int,name:string,nested__nst__field2:string>

Join queries

The following join queries will have a dataset a bit different from the one at the beginning. The following are extracted from the HybridScanForNestedFieldsTest tests.

root
 |-- Date: string (nullable = true)
 |-- RGUID: string (nullable = true)
 |-- Query: string (nullable = true)
 |-- imprs: integer (nullable = true)
 |-- clicks: integer (nullable = true)
 |-- nested: struct (nullable = true)
 |    |-- id: string (nullable = true)
 |    |-- leaf: struct (nullable = true)
 |    |    |-- id: string (nullable = true)
 |    |    |-- cnt: integer (nullable = true)

Join append only

Original plan

Project [cnt#556, query#533, id#557, Date#543, id#563]
+- Join Inner, (cnt#556 = cnt#562)
   :- Project [nested#536.leaf.cnt AS cnt#556, query#533, nested#536.leaf.id AS id#557]
   :  +- Filter ((isnotnull(nested#536) && (nested#536.leaf.cnt >= 20)) && (nested#536.leaf.cnt <= 40))
   :     +- Relation[Date#531,RGUID#532,Query#533,imprs#534,clicks#535,nested#536] parquet
   +- Project [nested#548.leaf.cnt AS cnt#562, Date#543, nested#548.leaf.id AS id#563]
      +- Filter ((isnotnull(nested#548) && (nested#548.leaf.cnt <= 40)) && (nested#548.leaf.cnt >= 20))
         +- Relation[Date#543,RGUID#544,Query#545,imprs#546,clicks#547,nested#548] parquet

Altered optimized plan

Project [cnt#653, query#533, id#654, Date#543, id#660]
+- Join Inner, (cnt#653 = cnt#659)
   :- BucketUnion 200 buckets, bucket columns: [cnt]
   :  :- Project [nested__leaf__cnt#0 AS cnt#653, query#1 AS query#533, nested__leaf__id#2 AS id#654]
   :  :  +- Filter ((isnotnull(nested__leaf__cnt#0) && (nested__leaf__cnt#0 >= 20)) && 
   :  :     :      (nested__leaf__cnt#0 <= 40))
   :  :     +- Relation[nested__leaf__cnt#0,Query#1,nested__leaf__id#2] 
   :  :          Hyperspace(Type: CI, Name: index_Append, LogVersion: 1)
   :  +- RepartitionByExpression [cnt#653], 200
   :     +- Project [nested#536.leaf.cnt AS cnt#653, query#533, nested#536.leaf.id AS id#654]
   :        +- Filter ((isnotnull(nested#536) && (nested#536.leaf.cnt >= 20)) && 
   :           :      (nested#536.leaf.cnt <= 40))
   :           +- Relation[Query#533,nested#536] parquet
   +- BucketUnion 200 buckets, bucket columns: [cnt]
      :- Project [nested__leaf__cnt#0 AS cnt#659, Date#1 AS Date#543, nested__leaf__id#2 AS id#660]
      :  +- Filter ((isnotnull(nested__leaf__cnt#0) && (nested__leaf__cnt#0 <= 40)) && 
      :     :      (nested__leaf__cnt#0 >= 20))
      :     +- Relation[nested__leaf__cnt#0,Date#1,nested__leaf__id#2] 
      :          Hyperspace(Type: CI, Name: indexType2_Append, LogVersion: 1)
      +- RepartitionByExpression [cnt#659], 200
         +- Project [nested#548.leaf.cnt AS cnt#659, Date#543, nested#548.leaf.id AS id#660]
            +- Filter ((isnotnull(nested#548) && (nested#548.leaf.cnt <= 40)) && 
               :      (nested#548.leaf.cnt >= 20))
               +- Relation[Date#543,nested#548] parquet

Altered Spark plan

Project [cnt#653, query#533, id#654, Date#543, id#660]
+- SortMergeJoin [cnt#653], [cnt#659], Inner
   :- BucketUnion 200 buckets, bucket columns: [cnt]
   :  :- Project [nested__leaf__cnt#0 AS cnt#653, query#1 AS query#533, nested__leaf__id#2 AS id#654]
   :  :  +- Filter ((isnotnull(nested__leaf__cnt#0) && (nested__leaf__cnt#0 >= 20)) && 
   :  :     :      (nested__leaf__cnt#0 <= 40))
   :  :     +- FileScan Hyperspace(Type: CI, Name: index_Append, LogVersion: 1) 
   :  :          [nested__leaf__cnt#0,Query#1,nested__leaf__id#2] 
   :  :          Batched: true, Format: Parquet, 
   :  :          Location: InMemoryFileIndex[file:/.../spark_warehouse/indexes/...],
   :  :          PartitionFilters: [], PushedFilters: [IsNotNull(nested__leaf__cnt), 
   :  :          GreaterThanOrEqual(nested__leaf__cnt,20), LessThanOrEqual(nested__leaf__cnt,40), 
   :  :          ReadSchema: struct<nested__leaf__cnt:int,Query:string,nested__leaf__id:string>, 
   :  :          SelectedBucketsCount: 200 out of 200
   :  +- Exchange hashpartitioning(cnt#653, 200)
   :     +- Project [nested#536.leaf.cnt AS cnt#653, query#533, nested#536.leaf.id AS id#654]
   :        +- Filter ((isnotnull(nested#536) && (nested#536.leaf.cnt >= 20)) && 
   :           :      (nested#536.leaf.cnt <= 40))
   :           +- FileScan parquet [Query#533,nested#536] Batched: false, Format: Parquet, 
   :                Location: InMemoryFileIndex[file:/.../..., PartitionFilters: [], 
   :                PushedFilters: [IsNotNull(nested)], 
   :                ReadSchema: struct<Query:string,nested:struct<id:string,leaf:struct<id:string,cnt:int>>>
   +- BucketUnion 200 buckets, bucket columns: [cnt]
      :- Project [nested__leaf__cnt#0 AS cnt#659, Date#1 AS Date#543, nested__leaf__id#2 AS id#660]
      :  +- Filter ((isnotnull(nested__leaf__cnt#0) && (nested__leaf__cnt#0 <= 40)) && 
      :     :      (nested__leaf__cnt#0 >= 20))
      :     +- FileScan Hyperspace(Type: CI, Name: indexType2_Append, LogVersion: 1) 
      :          [nested__leaf__cnt#0,Date#1,nested__leaf__id#2]
      :          Batched: true, Format: Parquet, 
      :          Location: InMemoryFileIndex[file://.../spark_warehouse/indexes/...], 
      :          PartitionFilters: [], PushedFilters: [IsNotNull(nested__leaf__cnt), 
      :          LessThanOrEqual(nested__leaf__cnt,40), GreaterThanOrEqual(nested__leaf__cnt,20), 
      :          ReadSchema: struct<nested__leaf__cnt:int,Date:string,nested__leaf__id:string>, 
      :          SelectedBucketsCount: 200 out of 200
      +- Exchange hashpartitioning(cnt#659, 200)
         +- Project [nested#548.leaf.cnt AS cnt#659, Date#543, nested#548.leaf.id AS id#660]
            +- Filter ((isnotnull(nested#548) && (nested#548.leaf.cnt <= 40)) && 
               :      (nested#548.leaf.cnt >= 20))
               +- FileScan parquet [Date#543,nested#548] Batched: false, Format: Parquet, 
                    Location: InMemoryFileIndex[file:/.../..., PartitionFilters: [], 
                    PushedFilters: [IsNotNull(nested)], 
                    ReadSchema: struct<Date:string,nested:struct<id:string,leaf:struct<id:string,cnt:int>>>

Delete files

Original plan

Project [cnt#556, query#533, Date#543]
+- Join Inner, (cnt#556 = cnt#560)
   :- Project [nested#536.leaf.cnt AS cnt#556, query#533]
   :  +- Filter ((isnotnull(nested#536) && (nested#536.leaf.cnt >= 20)) && 
   :     :      (nested#536.leaf.cnt <= 40))
   :     +- Relation[Date#531,RGUID#532,Query#533,imprs#534,clicks#535,nested#536] parquet
   +- Project [nested#548.leaf.cnt AS cnt#560, Date#543]
      +- Filter ((isnotnull(nested#548) && (nested#548.leaf.cnt <= 40)) && 
         :      (nested#548.leaf.cnt >= 20))
         +- Relation[Date#543,RGUID#544,Query#545,imprs#546,clicks#547,nested#548] parquet

Altered optimized plan

Project [cnt#605, query#533, Date#543]
+- Join Inner, (cnt#605 = cnt#609)
   :- Project [nested__leaf__cnt#0 AS cnt#605, query#1 AS query#533]
   :  +- Filter ((isnotnull(nested__leaf__cnt#0) && (nested__leaf__cnt#0 >= 20)) && 
   :     :      (nested__leaf__cnt#0 <= 40))
   :     +- Project [nested__leaf__cnt#0, Query#1, nested__leaf__id#2]
   :        +- Filter NOT _data_file_id#615L IN (2,3)
   :           +- Relation[nested__leaf__cnt#0,Query#1,nested__leaf__id#2,_data_file_id#615L] 
   :                Hyperspace(Type: CI, Name: index_Delete, LogVersion: 1)
   +- Project [nested__leaf__cnt#0 AS cnt#609, Date#1 AS Date#543]
      +- Filter ((isnotnull(nested__leaf__cnt#0) && (nested__leaf__cnt#0 <= 40)) && 
         :      (nested__leaf__cnt#0 >= 20))
         +- Project [nested__leaf__cnt#0, Date#1, nested__leaf__id#2]
            +- Filter NOT _data_file_id#616L IN (2,3)
               +- Relation[nested__leaf__cnt#0,Date#1,nested__leaf__id#2,_data_file_id#616L] 
                    Hyperspace(Type: CI, Name: indexType2_Delete2, LogVersion: 1)

Altered Spark plan

Project [cnt#605, query#533, Date#543]
+- SortMergeJoin [cnt#605], [cnt#609], Inner
   :- Project [nested__leaf__cnt#0 AS cnt#605, query#1 AS query#533]
   :  +- Filter (((NOT _data_file_id#615L IN (2,3) && isnotnull(nested__leaf__cnt#0)) && 
   :     :      (nested__leaf__cnt#0 >= 20)) && (nested__leaf__cnt#0 <= 40))
   :     +- FileScan Hyperspace(Type: CI, Name: index_Delete, LogVersion: 1) 
   :          [nested__leaf__cnt#0,Query#1,_data_file_id#615L] 
   :          Batched: true, Format: Parquet, 
   :          Location: InMemoryFileIndex[file:/.../spark_warehouse/indexes/...], 
   :          PartitionFilters: [], PushedFilters: [Not(In(_data_file_id, [2,3])), 
   :          IsNotNull(nested__leaf__cnt), GreaterThanOrEqual(nested__leaf__cnt,20),  
   :          ReadSchema: struct<nested__leaf__cnt:int,Query:string,_data_file_id:bigint>, 
   :          SelectedBucketsCount: 200 out of 200
   +- Project [nested__leaf__cnt#0 AS cnt#609, Date#1 AS Date#543]
      +- Filter (((NOT _data_file_id#616L IN (2,3) && isnotnull(nested__leaf__cnt#0)) && 
         :      (nested__leaf__cnt#0 <= 40)) && (nested__leaf__cnt#0 >= 20))
         +- FileScan Hyperspace(Type: CI, Name: indexType2_Delete2, LogVersion: 1) 
              [nested__leaf__cnt#0,Date#1,_data_file_id#616L] 
              Batched: true, Format: Parquet, 
              Location: InMemoryFileIndex[file:/.../spark_warehouse/indexes/...], 
              PartitionFilters: [], PushedFilters: [Not(In(_data_file_id, [2,3])), 
              IsNotNull(nested__leaf__cnt), LessThanOrEqual(nested__leaf__cnt,40), 
              ReadSchema: struct<nested__leaf__cnt:int,Date:string,_data_file_id:bigint>, 
              SelectedBucketsCount: 200 out of 200

Append + Delete

Original plan

Project [cnt#556, query#533, Date#543]
+- Join Inner, (cnt#556 = cnt#560)
   :- Project [nested#536.leaf.cnt AS cnt#556, query#533]
   :  +- Filter ((isnotnull(nested#536) && (nested#536.leaf.cnt >= 20)) && (nested#536.leaf.cnt <= 40))
   :     +- Relation[Date#531,RGUID#532,Query#533,imprs#534,clicks#535,nested#536] parquet
   +- Project [nested#548.leaf.cnt AS cnt#560, Date#543]
      +- Filter ((isnotnull(nested#548) && (nested#548.leaf.cnt <= 40)) && (nested#548.leaf.cnt >= 20))
         +- Relation[Date#543,RGUID#544,Query#545,imprs#546,clicks#547,nested#548] parquet

Altered optimized plan

Project [cnt#617, query#533, Date#543]
+- Join Inner, (cnt#617 = cnt#621)
   :- BucketUnion 200 buckets, bucket columns: [cnt]
   :  :- Project [nested__leaf__cnt#0 AS cnt#617, query#1 AS query#533]
   :  :  +- Filter ((isnotnull(nested__leaf__cnt#0) && (nested__leaf__cnt#0 >= 20)) && 
   :  :     :      (nested__leaf__cnt#0 <= 40))
   :  :     +- Project [nested__leaf__cnt#0, Query#1, nested__leaf__id#2]
   :  :        +- Filter NOT (_data_file_id#627L = 3)
   :  :           +- Relation[nested__leaf__cnt#0,Query#1,nested__leaf__id#2,_data_file_id#627L] 
   :  :                Hyperspace(Type: CI, Name: index_Both, LogVersion: 1)
   :  +- RepartitionByExpression [cnt#617], 200
   :     +- Project [nested#536.leaf.cnt AS cnt#617, query#533]
   :        +- Filter ((isnotnull(nested#536) && (nested#536.leaf.cnt >= 20)) && 
   :           :      (nested#536.leaf.cnt <= 40))
   :           +- Relation[Query#533,nested#536] parquet
   +- Project [nested__leaf__cnt#0 AS cnt#621, Date#1 AS Date#543]
      +- Filter ((isnotnull(nested__leaf__cnt#0) && (nested__leaf__cnt#0 <= 40)) && 
         :      (nested__leaf__cnt#0 >= 20))
         +- Project [nested__leaf__cnt#0, Date#1, nested__leaf__id#2]
            +- Filter NOT _data_file_id#628L INSET (2,3)
               +- Relation[nested__leaf__cnt#0,Date#1,nested__leaf__id#2,_data_file_id#628L] 
                    Hyperspace(Type: CI, Name: indexType2_Delete, LogVersion: 1)

Altere Spark plan

Project [cnt#617, query#533, Date#543]
+- SortMergeJoin [cnt#617], [cnt#621], Inner
   :- BucketUnion 200 buckets, bucket columns: [cnt]
   :  :- Project [nested__leaf__cnt#0 AS cnt#617, query#1 AS query#533]
   :  :  +- Filter (((NOT (_data_file_id#627L = 3) && isnotnull(nested__leaf__cnt#0)) &&
   :  :     :      (nested__leaf__cnt#0 >= 20)) && (nested__leaf__cnt#0 <= 40))
   :  :     +- FileScan Hyperspace(Type: CI, Name: index_Both, LogVersion: 1) 
   :  :          [nested__leaf__cnt#0,Query#1,_data_file_id#627L] 
   :  :          Batched: true, Format: Parquet, 
   :  :          Location: InMemoryFileIndex[file:/.../spark_warehouse/indexes/...], 
   :  :          PartitionFilters: [], PushedFilters: [Not(EqualTo(_data_file_id,3)), 
   :  :          IsNotNull(nested__leaf__cnt), GreaterThanOrEqual(nested__leaf__cnt,20), 
   :  :          ReadSchema: struct<nested__leaf__cnt:int,Query:string,_data_file_id:bigint>, 
   :  :          SelectedBucketsCount: 200 out of 200
   :  +- Exchange hashpartitioning(cnt#617, 200)
   :     +- Project [nested#536.leaf.cnt AS cnt#617, query#533]
   :        +- Filter ((isnotnull(nested#536) && (nested#536.leaf.cnt >= 20)) && 
   :           :      (nested#536.leaf.cnt <= 40))
   :           +- FileScan parquet [Query#533,nested#536] Batched: false, Format: Parquet, 
   :                Location: InMemoryFileIndex[file:/.../...], 
   :                PartitionFilters: [], PushedFilters: [IsNotNull(nested)], 
   :                ReadSchema: struct<Query:string,nested:struct<id:string,leaf:struct<id:string,cnt:int>>>
   +- Project [nested__leaf__cnt#0 AS cnt#621, Date#1 AS Date#543]
      +- Filter (((NOT _data_file_id#628L INSET (2,3) && isnotnull(nested__leaf__cnt#0)) && 
         :      (nested__leaf__cnt#0 <= 40)) && (nested__leaf__cnt#0 >= 20))
         +- FileScan Hyperspace(Type: CI, Name: indexType2_Delete, LogVersion: 1) 
              [nested__leaf__cnt#0,Date#1,_data_file_id#628L] 
              Batched: true, Format: Parquet, 
              Location: InMemoryFileIndex[file:/.../spark_warehouse/indexes/..., 
              PartitionFilters: [], PushedFilters: [Not(In(_data_file_id, [2,3])), 
              IsNotNull(nested__leaf__cnt), LessThanOrEqual(nested__leaf__cnt,40),
              ReadSchema: struct<nested__leaf__cnt:int,Date:string,_data_file_id:bigint>, 
              SelectedBucketsCount: 200 out of 200

Impact

There should be no impact on the current Hyperspace implementations.

Background

This is a result of #312 discussion.

Issue Analytics

  • State:open
  • Created 3 years ago
  • Reactions:2
  • Comments:11 (8 by maintainers)

github_iconTop GitHub Comments

1reaction
sezrubycommented, Feb 26, 2021

Based on the field name might cause some confusion/problem if there are same column names on both left and right. So I think we should keep trying use the id.

1reaction
andrei-ionescucommented, Feb 2, 2021

At first, it was intentional to project the field that we search by (the indexed field). But I switch to your suggestion as it seems more used.

I’ll add details for joins too, as I get more understanding how the plans do get composed.

Read more comments on GitHub >

github_iconTop Results From Across the Web

[PROPOSAL]: Support indexes on nested arrays of struct #372
Problem Statement. Hyperspace does NOT support indexing over columns/fields of type array of struct. Background and Motivation.
Read more >
Nested field type | Elasticsearch Guide [8.5] | Elastic
The nested type is a specialised version of the object data type that allows arrays of objects to be indexed in a way...
Read more >
Accessing Nested Objects in JavaScript - DEV Community ‍ ‍
Accessing nested objects in JavaScript without getting choked by ... 'name']); // to access nested array, just pass in array index as an ......
Read more >
Proposal for nested document support in Lucene - SlideShare
Nested Documents in LuceneHigh-performance support for parent/child document ... Requires an indexed field to identify parent documents</li></ul>?
Read more >
Object Fields VS. Nested Field Types in Elasticsearch - Opster
Nested is a special type of object that is indexed as a separate document, and a reference to each of these inner documents...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found