question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

[PROPOSAL]: Changes to support for multiple index types

See original GitHub issue

Problem Statement

Code changes required to support multiple index types like bloom filter index and partition elimination index.

Background and Motivation

Why limit to covering indexes? Let’s expand hyperspace to make it flexible for more index types.

Proposed Solution

Make changes to the existing design to allow for flexibility in adding more index types.

Design

Changes to Action classes

Actions Class diagrams

Applying Rules: Updated

  1. Have a single Hyperspace rule which gets added to spark optimizer. This is composed of internal pluggable rules
object HyperspaceRule extends Rule[LogicalPlan] = {
  def apply(plan: LogicalPlan) = {
    new Ranker(new Selection().select(plan)).head
  }
}
  1. Have multiple internal rules which work on specialized types of indexes. For e.g. CIJoinRule, CIFilterRule, BFFilterRule, PEFilterRule etc.

  2. Generate final plans by applying all rules independently to the current plan.

class Selection {
  val rules: Seq[HyperspaceInternalRule] = JoinRule :: FilterRule :: Nil
  def select(plan: LogicalPlan): Seq[LogicalPlan] = {
    rules.flatMap(r => r(plan))
  }
}
  1. Rank them based on hueristics (currently hardcoded rules) to get a cost-wise ordered list of plans. Pick the head plan and return (global ranker)
class Ranker {
  def rank(plans: Seq[LogicalPlan]): Seq[LogicalPlan]
}

PartitionEliminationIndex Design

Extending the new index config defined in this design doc: https://github.com/microsoft/hyperspace/issues/341 we can define the PartitionElimination non-covering index as below:

case class PartitionEliminationIndexConfig extends NonCoveringIndexConfig

Using PartitionElimination Index

PartitionEliminationIndex is a reverse index from index columns and the data files which contain these values. These could be useful especially for point lookups and range queries.

Implementation

Refactoring Tasks

Tasks:

  1. trait: CreateIndex
  2. Class: CreateBFIndex: def op()
  3. Class: CreatePEIndex: def op()
  4. Class: CreateCoveringIndex: def op()
  5. Class: CreatePartitionEliminationIndex: def op()
  6. Class: Ranker : Global ranker which is hardcoded as of now
  7. Class: Selection
  8. Class: HyperspaceRule

PartitionEliminationIndex specific tasks

PartitionEliminationFilterIndexRule

  1. Get the source plan
  2. Get the index similar to covering index rule. Choose only those indexes whose type is PartitionEliminationIndex
  3. Run a spark query on index data with the query on the index columns.
  4. Collect a list of data file paths which satisfy the index.
  5. Return a new logical plan which reads data from these filtered source data files.

Creating PEIndex

  1. Extend from Covering Index
  2. Exactly as creating a covering index. Just skip the included columns and add the filename column by default.

Refreshing PEIndex

  1. Extend from Covering Index
  2. Exactly as refreshing a covering index. Just skip the included columns and add the file name column by default.

Optimizing PEIndex

  1. Extend from Covering Index
  2. Exactly as optimizing a covering index. Just skip the included columns and add the file name column by default.

Order of PRs:

Refactoring

  1. Updates for a single Hyperspace rule. (3d)
  2. Refactor IndexConfig and IndexLogEntry for existing Covering Index. Update apis to reflect this change (1w)
  3. Refactor CreateIndex for Covering Index. (3d)
  4. Refactor RefreshIndex for Covering Index. (3d)
  5. Refactor OptimizeIndex for Covering Index. (3d)

BFIndex

  1. Introduce new Index type to support: a. CreateIndex b. Supported Rules for this index type
  2. RefreshIndex
  3. OptimizeIndex
  4. any other index maintenance operation required

PEIndex

  1. Introduce new Index type to support: a. CreateIndex (2d) b. Supported Rules for this index type (2w)
  2. RefreshIndex (2d)
  3. OptimizeIndex (2d)
  4. any other index maintenance operation required (-)

Performance Implications (if applicable)

None

Alternate Design Options

Issue Analytics

  • State:closed
  • Created 3 years ago
  • Comments:10 (7 by maintainers)

github_iconTop GitHub Comments

2reactions
andrei-ionescucommented, Apr 7, 2021

@rapoth, @apoorvedave1: For file skipping indexing please have a look on the XSkipper built by IBM.

1reaction
rapothcommented, Feb 2, 2021

+1 to what @sezruby is saying. I’m in favor of separating out the rules per index type since the logic might be totally different. Ideally, we would have:

  1. Covering index comes with a collection of rules e.g., FilterIndexRule, JoinIndexRule and later AggIndexRule
  2. Non-covering indexes (like fine-grained partition elimination and bloom filter index) will come with their own set of rules e.g., FilterRule to begin with. Also, note that the join optimizations through fine-grained partition elimination (index intersection) and bloom filter indexes (bloom filter gets pushed to one side of the join) are totally different and I do not see any reusability.

The important thing we should consider is ensuring the duplication is minimum to the extent possible.

@apoorvedave1 @thugsatbay What are your thoughts on this?

Read more comments on GitHub >

github_iconTop Results From Across the Web

Choosing between one or more indices - Algolia
Learn when to use one index or several indices, depending on your relevance strategy and the user experience you want to create.
Read more >
Optimize index maintenance to improve query performance ...
This article describes index maintenance concepts, and a recommended strategy to maintain indexes.
Read more >
Summary of Significant Changes - National Science Foundation
h, Current and Pending Support, has been revised to clarify NSF's longstanding requirements regarding submission of current and pending support information.
Read more >
Index merge: using multiple indexes for one table access
The answer is very simple in most cases: one index with multiple columns is better—that is, a concatenated or compound index. “Concatenated Indexes”...
Read more >
Designing effective SQL Server non-clustered indexes
When designing a Non-clustered index, you should consider the type of the workload performed on your database or table by compromising between ...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found