[PROPOSAL]: Changes to support for multiple index types
See original GitHub issueProblem Statement
Code changes required to support multiple index types like bloom filter index and partition elimination index.
Background and Motivation
Why limit to covering indexes? Let’s expand hyperspace to make it flexible for more index types.
Proposed Solution
Make changes to the existing design to allow for flexibility in adding more index types.
Design
Changes to Action classes
Applying Rules: Updated
- Have a single Hyperspace rule which gets added to spark optimizer. This is composed of internal pluggable rules
object HyperspaceRule extends Rule[LogicalPlan] = {
def apply(plan: LogicalPlan) = {
new Ranker(new Selection().select(plan)).head
}
}
-
Have multiple internal rules which work on specialized types of indexes. For e.g. CIJoinRule, CIFilterRule, BFFilterRule, PEFilterRule etc.
-
Generate final plans by applying all rules independently to the current plan.
class Selection {
val rules: Seq[HyperspaceInternalRule] = JoinRule :: FilterRule :: Nil
def select(plan: LogicalPlan): Seq[LogicalPlan] = {
rules.flatMap(r => r(plan))
}
}
- Rank them based on hueristics (currently hardcoded rules) to get a cost-wise ordered list of plans. Pick the head plan and return (global ranker)
class Ranker {
def rank(plans: Seq[LogicalPlan]): Seq[LogicalPlan]
}
PartitionEliminationIndex Design
Extending the new index config defined in this design doc: https://github.com/microsoft/hyperspace/issues/341 we can define the PartitionElimination non-covering index as below:
case class PartitionEliminationIndexConfig extends NonCoveringIndexConfig
Using PartitionElimination Index
PartitionEliminationIndex is a reverse index from index columns and the data files which contain these values. These could be useful especially for point lookups and range queries.
Implementation
Refactoring Tasks
Tasks:
- trait: CreateIndex
- Class: CreateBFIndex: def op()
- Class: CreatePEIndex: def op()
- Class: CreateCoveringIndex: def op()
- Class: CreatePartitionEliminationIndex: def op()
- Class: Ranker : Global ranker which is hardcoded as of now
- Class: Selection
- Class: HyperspaceRule
PartitionEliminationIndex specific tasks
PartitionEliminationFilterIndexRule
- Get the source plan
- Get the index similar to covering index rule. Choose only those indexes whose type is PartitionEliminationIndex
- Run a spark query on index data with the query on the index columns.
- Collect a list of data file paths which satisfy the index.
- Return a new logical plan which reads data from these filtered source data files.
Creating PEIndex
- Extend from Covering Index
- Exactly as creating a covering index. Just skip the included columns and add the filename column by default.
Refreshing PEIndex
- Extend from Covering Index
- Exactly as refreshing a covering index. Just skip the included columns and add the file name column by default.
Optimizing PEIndex
- Extend from Covering Index
- Exactly as optimizing a covering index. Just skip the included columns and add the file name column by default.
Order of PRs:
Refactoring
- Updates for a single Hyperspace rule. (3d)
- Refactor IndexConfig and IndexLogEntry for existing Covering Index. Update apis to reflect this change (1w)
- Refactor CreateIndex for Covering Index. (3d)
- Refactor RefreshIndex for Covering Index. (3d)
- Refactor OptimizeIndex for Covering Index. (3d)
BFIndex
- Introduce new Index type to support: a. CreateIndex b. Supported Rules for this index type
- RefreshIndex
- OptimizeIndex
- any other index maintenance operation required
PEIndex
- Introduce new Index type to support: a. CreateIndex (2d) b. Supported Rules for this index type (2w)
- RefreshIndex (2d)
- OptimizeIndex (2d)
- any other index maintenance operation required (-)
Performance Implications (if applicable)
None
Alternate Design Options
Issue Analytics
- State:
- Created 3 years ago
- Comments:10 (7 by maintainers)
@rapoth, @apoorvedave1: For file skipping indexing please have a look on the XSkipper built by IBM.
+1 to what @sezruby is saying. I’m in favor of separating out the rules per index type since the logic might be totally different. Ideally, we would have:
The important thing we should consider is ensuring the duplication is minimum to the extent possible.
@apoorvedave1 @thugsatbay What are your thoughts on this?