[FEATURE REQUEST]: Support partitioning and bucketing of the index dataset
See original GitHub issueFeature requested
In the case of very large datasets (34 Billion
records) the generated index is formed out of big files and has a performance degradation.
Given the following details…
Query
val sql = s"""
SELECT ts.timestamp
FROM ts
WHERE ts.timestamp >= to_timestamp('2020-03-17')
AND ts.timestamp < to_timestamp('2020-03-18')
LIMIT 1000
"""
Executed with:
spark.sql(sql).collect
Dataset
- schema size is about 20 top fields and about 17 of these are heavily nested
- about
34 Billion
rows - the
timestamp
field is oftimestamp
type and is up to seconds - the cardinality of the timestamp values is:
17 145 000
out of34 155 510 037
- the format is Iceberg
Index
hs.createIndex(
ts,
IndexConfig(
"idx_ts3",
indexedColumns = Seq("timestamp"),
includedColumns = Seq("ns", "id")))
The index has:
434GB
total index size200
files2.3GB
average file size
Explained query
=============================================================
Plan with indexes:
=============================================================
CollectLimit 1000
+- *(1) Project [timestamp#207]
+- *(1) Filter ((isnotnull(timestamp#207) && (timestamp#207 >= 1584403200000000)) && (timestamp#207 < 1584489600000000))
<----+- *(1) FileScan Hyperspace(Type: CI, Name: idx_ts3, LogVersion: 1) [timestamp#207] Batched: true, DataFilters: [isnotnull(timestamp#207), (timestamp#207 >= 1584403200000000), (timestamp#207 < 1584489600000000)], Format: Parquet, Location: InMemoryFileIndex[dbfs:/u.../spark-warehouse/indexes/idx_ts3/v__=0/part-00000-tid-451174797136..., PartitionFilters: [], PushedFilters: [IsNotNull(timestamp), GreaterThanOrEqual(timestamp,2020-03-17 00:00:00.0), LessThan(timestamp,20..., ReadSchema: struct<timestamp:timestamp>---->
=============================================================
Plan without indexes:
=============================================================
CollectLimit 1000
+- *(1) Project [timestamp#207]
+- *(1) Filter ((isnotnull(timestamp#207) && (timestamp#207 >= 1584403200000000)) && (timestamp#207 < 1584489600000000))
<----+- *(1) ScanV2 iceberg[timestamp#207] (Filters: [isnotnull(timestamp#207), (timestamp#207 >= 1584403200000000), (timestamp#207 < 1584489600000000)], Options: [...)---->
=============================================================
Indexes used:
=============================================================
idx_ts3:/.../spark-warehouse/indexes/idx_ts3/v__=0
=============================================================
Physical operator stats:
=============================================================
+--------------------------------------------------------+-------------------+------------------+----------+
| Physical Operator|Hyperspace Disabled|Hyperspace Enabled|Difference|
+--------------------------------------------------------+-------------------+------------------+----------+
| *DataSourceV2Scan| 1| 0| -1|
|*Scan Hyperspace(Type: CI, Name: idx_ts3, LogVersion: 1)| 0| 1| 1|
| CollectLimit| 1| 1| 0|
| Filter| 1| 1| 0|
| Project| 1| 1| 0|
| WholeStageCodegen| 1| 1| 0|
+--------------------------------------------------------+-------------------+------------------+----------+
The cluster
I did run the experiment on Databricks cluster with the following details:
- driver:
64
cores432GB
memory 6
workers:32
cores256GB
memory- Spark version
2.4.5
Results
Time to get the 1000
rows:
- with Hyperspace is
17.24s
- without Hyperspace is
16.86s
Acceptance criteria
The time to get 1000
rows using Hyperspace should be at least as twice as fast.
Additional context
For some more context, this has been started on #329 PR.
Issue Analytics
- State:
- Created 3 years ago
- Comments:5 (4 by maintainers)
Top Results From Across the Web
Creating partitioned tables | BigQuery - Google Cloud
Create an integer-range partitioned table · Open the BigQuery page in the Google Cloud console. · In the Explorer panel, expand your project...
Read more >Improve query performance using AWS Glue partition indexes
Partition indexes are available for queries in Amazon EMR, Amazon Redshift Spectrum, and AWS Glue extract, transform, and load (ETL) jobs (Spark ...
Read more >Get the best out of Oracle Partitioning
Index is partitioned independently of data. • Each index structure may reference any and all partitions. Pros. • Availability and manageability.
Read more >15.10 - Guidelines for Partitioning Column-Partitioned Tables ...
Code the column partitioning level first and follo... ... for Partitioning Column-Partitioned Tables and Join Indexes - Teradata Database.
Read more >Index Partitioning | Couchbase Docs
Index Partitioning enables you to increase aggregate query performance by dividing and spreading a large index of documents across multiple nodes, ...
Read more >
Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free
Top Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Thank you for opening this issue @andrei-ionescu!
I’m copying some context from #329 for easier readability
From: @imback82
From: @andrei-ionescu
From: @imback82
From: @andrei-ionescu
@sezruby Thanks for confirming my current understanding of how Hyperspace works.
This is something that I want to avoid - duplicating the dataset once more (all columns means the whole dataset but bucketed and sorted differently).
But I guess the need of include all columns in the
includedColumns
is needed because:ts.*
(aka select all)Iceberg uses partition pruning, file skipping based on the metadata it has stored and pushed down predicates. This is the reason why it is faster than Hyperspace in some cases.
For
timestamp
types there is a complexity added to it - the fact that the precision of the timestamp stored in the dataset is at second level while we query for days. I guess that it needs to go through the whole dataset to transform the column even though the index dataset is bucketed.