Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

[PROPOSAL]: Changes to support for multiple index types

See original GitHub issue

Problem Statement

Code changes required to support multiple index types like bloom filter index and partition elimination index.

Background and Motivation

Why limit to covering indexes? Let’s expand hyperspace to make it flexible for more index types.

Proposed Solution

Make changes to the existing design to allow for flexibility in adding more index types.

Design

Changes to Action classes

Actions Class diagrams

Applying Rules: Updated

Have a single Hyperspace rule which gets added to spark optimizer. This is composed of internal pluggable rules

object HyperspaceRule extends Rule[LogicalPlan] = {
  def apply(plan: LogicalPlan) = {
    new Ranker(new Selection().select(plan)).head
  }
}

Have multiple internal rules which work on specialized types of indexes. For e.g. CIJoinRule, CIFilterRule, BFFilterRule, PEFilterRule etc.
Generate final plans by applying all rules independently to the current plan.

class Selection {
  val rules: Seq[HyperspaceInternalRule] = JoinRule :: FilterRule :: Nil
  def select(plan: LogicalPlan): Seq[LogicalPlan] = {
    rules.flatMap(r => r(plan))
  }
}

Rank them based on hueristics (currently hardcoded rules) to get a cost-wise ordered list of plans. Pick the head plan and return (global ranker)

class Ranker {
  def rank(plans: Seq[LogicalPlan]): Seq[LogicalPlan]
}

PartitionEliminationIndex Design

Extending the new index config defined in this design doc: https://github.com/microsoft/hyperspace/issues/341 we can define the PartitionElimination non-covering index as below:

case class PartitionEliminationIndexConfig extends NonCoveringIndexConfig

Using PartitionElimination Index

PartitionEliminationIndex is a reverse index from index columns and the data files which contain these values. These could be useful especially for point lookups and range queries.

Implementation

Refactoring Tasks

Tasks:

trait: CreateIndex
Class: CreateBFIndex: def op()
Class: CreatePEIndex: def op()
Class: CreateCoveringIndex: def op()
Class: CreatePartitionEliminationIndex: def op()
Class: Ranker : Global ranker which is hardcoded as of now
Class: Selection
Class: HyperspaceRule

PartitionEliminationIndex specific tasks

PartitionEliminationFilterIndexRule

Get the source plan
Get the index similar to covering index rule. Choose only those indexes whose type is PartitionEliminationIndex
Run a spark query on index data with the query on the index columns.
Collect a list of data file paths which satisfy the index.
Return a new logical plan which reads data from these filtered source data files.

Creating PEIndex

Extend from Covering Index
Exactly as creating a covering index. Just skip the included columns and add the filename column by default.

Refreshing PEIndex

Extend from Covering Index
Exactly as refreshing a covering index. Just skip the included columns and add the file name column by default.

Optimizing PEIndex

Extend from Covering Index
Exactly as optimizing a covering index. Just skip the included columns and add the file name column by default.

Order of PRs:

Refactoring

Updates for a single Hyperspace rule. (3d)
Refactor IndexConfig and IndexLogEntry for existing Covering Index. Update apis to reflect this change (1w)
Refactor CreateIndex for Covering Index. (3d)
Refactor RefreshIndex for Covering Index. (3d)
Refactor OptimizeIndex for Covering Index. (3d)

BFIndex

Introduce new Index type to support: a. CreateIndex b. Supported Rules for this index type
RefreshIndex
OptimizeIndex
any other index maintenance operation required

PEIndex

Introduce new Index type to support: a. CreateIndex (2d) b. Supported Rules for this index type (2w)
RefreshIndex (2d)
OptimizeIndex (2d)
any other index maintenance operation required (-)

Performance Implications (if applicable)

None

Alternate Design Options

Issue Analytics

State:
Created 3 years ago
Comments:10 (7 by maintainers)

Top GitHub Comments

2reactions

andrei-ionescucommented, Apr 7, 2021

@rapoth, @apoorvedave1: For file skipping indexing please have a look on the XSkipper built by IBM.

1reaction

rapothcommented, Feb 2, 2021

+1 to what @sezruby is saying. I’m in favor of separating out the rules per index type since the logic might be totally different. Ideally, we would have:

Covering index comes with a collection of rules e.g., FilterIndexRule, JoinIndexRule and later AggIndexRule
Non-covering indexes (like fine-grained partition elimination and bloom filter index) will come with their own set of rules e.g., FilterRule to begin with. Also, note that the join optimizations through fine-grained partition elimination (index intersection) and bloom filter indexes (bloom filter gets pushed to one side of the join) are totally different and I do not see any reusability.

The important thing we should consider is ensuring the duplication is minimum to the extent possible.

@apoorvedave1 @thugsatbay What are your thoughts on this?

Top Results From Across the Web

Choosing between one or more indices - Algolia

Learn when to use one index or several indices, depending on your relevance strategy and the user experience you want to create.

Optimize index maintenance to improve query performance ...

This article describes index maintenance concepts, and a recommended strategy to maintain indexes.

Summary of Significant Changes - National Science Foundation

h, Current and Pending Support, has been revised to clarify NSF's longstanding requirements regarding submission of current and pending support information.

Index merge: using multiple indexes for one table access

The answer is very simple in most cases: one index with multiple columns is better—that is, a concatenated or compound index. “Concatenated Indexes”...

Designing effective SQL Server non-clustered indexes

When designing a Non-clustered index, you should consider the type of the workload performed on your database or table by compromising between ...