question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

[FEATURE REQUEST]: Support partitioning and bucketing of the index dataset

See original GitHub issue

Feature requested

In the case of very large datasets (34 Billion records) the generated index is formed out of big files and has a performance degradation.

Given the following details…

Query

val sql = s"""
  SELECT   ts.timestamp
  FROM     ts 
  WHERE    ts.timestamp >= to_timestamp('2020-03-17')
  AND      ts.timestamp < to_timestamp('2020-03-18')
  LIMIT    1000
"""

Executed with:

spark.sql(sql).collect

Dataset

  • schema size is about 20 top fields and about 17 of these are heavily nested
  • about 34 Billion rows
  • the timestamp field is of timestamp type and is up to seconds
  • the cardinality of the timestamp values is: 17 145 000 out of 34 155 510 037
  • the format is Iceberg

Index

hs.createIndex(
  ts, 
  IndexConfig(
    "idx_ts3", 
    indexedColumns = Seq("timestamp"), 
    includedColumns = Seq("ns", "id")))

The index has:

  • 434GB total index size
  • 200 files
  • 2.3GB average file size

Explained query

=============================================================
Plan with indexes:
=============================================================
CollectLimit 1000
+- *(1) Project [timestamp#207]
   +- *(1) Filter ((isnotnull(timestamp#207) && (timestamp#207 >= 1584403200000000)) && (timestamp#207 < 1584489600000000))
      <----+- *(1) FileScan Hyperspace(Type: CI, Name: idx_ts3, LogVersion: 1) [timestamp#207] Batched: true, DataFilters: [isnotnull(timestamp#207), (timestamp#207 >= 1584403200000000), (timestamp#207 < 1584489600000000)], Format: Parquet, Location: InMemoryFileIndex[dbfs:/u.../spark-warehouse/indexes/idx_ts3/v__=0/part-00000-tid-451174797136..., PartitionFilters: [], PushedFilters: [IsNotNull(timestamp), GreaterThanOrEqual(timestamp,2020-03-17 00:00:00.0), LessThan(timestamp,20..., ReadSchema: struct<timestamp:timestamp>---->

=============================================================
Plan without indexes:
=============================================================
CollectLimit 1000
+- *(1) Project [timestamp#207]
   +- *(1) Filter ((isnotnull(timestamp#207) && (timestamp#207 >= 1584403200000000)) && (timestamp#207 < 1584489600000000))
      <----+- *(1) ScanV2 iceberg[timestamp#207] (Filters: [isnotnull(timestamp#207), (timestamp#207 >= 1584403200000000), (timestamp#207 < 1584489600000000)], Options: [...)---->

=============================================================
Indexes used:
=============================================================
idx_ts3:/.../spark-warehouse/indexes/idx_ts3/v__=0

=============================================================
Physical operator stats:
=============================================================
+--------------------------------------------------------+-------------------+------------------+----------+
|                                       Physical Operator|Hyperspace Disabled|Hyperspace Enabled|Difference|
+--------------------------------------------------------+-------------------+------------------+----------+
|                                       *DataSourceV2Scan|                  1|                 0|        -1|
|*Scan Hyperspace(Type: CI, Name: idx_ts3, LogVersion: 1)|                  0|                 1|         1|
|                                            CollectLimit|                  1|                 1|         0|
|                                                  Filter|                  1|                 1|         0|
|                                                 Project|                  1|                 1|         0|
|                                       WholeStageCodegen|                  1|                 1|         0|
+--------------------------------------------------------+-------------------+------------------+----------+

The cluster

I did run the experiment on Databricks cluster with the following details:

  • driver: 64 cores 432GB memory
  • 6 workers: 32 cores 256GB memory
  • Spark version 2.4.5

Results

Time to get the 1000 rows:

  • with Hyperspace is 17.24s
  • without Hyperspace is 16.86s

Acceptance criteria

The time to get 1000 rows using Hyperspace should be at least as twice as fast.

Additional context

For some more context, this has been started on #329 PR.

Issue Analytics

  • State:open
  • Created 3 years ago
  • Comments:5 (4 by maintainers)

github_iconTop GitHub Comments

1reaction
rapothcommented, Feb 5, 2021

Thank you for opening this issue @andrei-ionescu!

I’m copying some context from #329 for easier readability

From: @imback82

Thanks @andrei-ionescu for the info.

I guess one solution to this is to allow transformation of the indexed keys (bucketing to minutes/hours instead of seconds, for example).

From: @andrei-ionescu

Even if we don’t modify (transform) the timestamp to minutes/hours, it is better from the current form. I’m having in average about 2K duplicates for each timestamp value in the index. Bucketing or partitioning just by the values will reduce the query time tremendously.

For example, instead of having the index dataset laid out like this:

/.../spark-warehouse/indexes/idx_ts3/v__=0/.....parquet

I would suggest the following:

/.../spark-warehouse/indexes/idx_ts3/v__=0/timestamp=2020-03-17T00:03:07/.....parquet

From: @imback82

I would suggest the following:

/.../spark-warehouse/indexes/idx_ts3/v__=0/timestamp=2020-03-17T00:03:07/.....parquet

Hive-partitioning was explored before but abandoned due to the fact that we need to create bucket files for each partition and wasn’t scalable in our scenario.

But now that we have a specific use case, we can explore this again (prob. in the form of specialized index).

From: @andrei-ionescu

@imback82 I’m proposing to add .partitionBy(resolvedIndexedColumns: _*). in between write and parquet similar to this:

  .repartition(resolvedIndexedColumns.map(df(_)): _*)
  .write
  .partitionBy(resolvedIndexedColumns: _*)
  .parquet(...)

somewhere around this place: CreateActionBase.scala#L129-L139.

This can be just a flag, or even better, an index config property, as in cases of high cardinality it may throw out a lot of folders/partitions.

We can go even a step further and detect it and choose the best approach.

0reactions
andrei-ionescucommented, Feb 16, 2021

@sezruby Thanks for confirming my current understanding of how Hyperspace works.

  • ts => indexed column (ns, id) included columns ( all other columns (required for the result) )

This is something that I want to avoid - duplicating the dataset once more (all columns means the whole dataset but bucketed and sorted differently).

But I guess the need of include all columns in the includedColumns is needed because:

  1. In my query I used the ts.* (aka select all)
  2. Hyperspace does NOT support clustered indexes (this is related to #354)

You can use the filter index as you tested, but I guess the performance is similar because Iceberg also handles push-down conditions & partitioning in somehow.

Iceberg uses partition pruning, file skipping based on the metadata it has stored and pushed down predicates. This is the reason why it is faster than Hyperspace in some cases.

For timestamp types there is a complexity added to it - the fact that the precision of the timestamp stored in the dataset is at second level while we query for days. I guess that it needs to go through the whole dataset to transform the column even though the index dataset is bucketed.

Read more comments on GitHub >

github_iconTop Results From Across the Web

Creating partitioned tables | BigQuery - Google Cloud
Create an integer-range partitioned table · Open the BigQuery page in the Google Cloud console. · In the Explorer panel, expand your project...
Read more >
Improve query performance using AWS Glue partition indexes
Partition indexes are available for queries in Amazon EMR, Amazon Redshift Spectrum, and AWS Glue extract, transform, and load (ETL) jobs (Spark ...
Read more >
Get the best out of Oracle Partitioning
Index is partitioned independently of data. • Each index structure may reference any and all partitions. Pros. • Availability and manageability.
Read more >
15.10 - Guidelines for Partitioning Column-Partitioned Tables ...
Code the column partitioning level first and follo... ... for Partitioning Column-Partitioned Tables and Join Indexes - Teradata Database.
Read more >
Index Partitioning | Couchbase Docs
Index Partitioning enables you to increase aggregate query performance by dividing and spreading a large index of documents across multiple nodes, ...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found