question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Support variable length Offline Dictionary Indexes for bytes, strings and maps to save on storage

See original GitHub issue

What? Currently, the dictionary index for offline segments for bytes and string types uses Fixed-size storage for each value (by picking the size of the max element and padding the smaller elements with “0”). See org.apache.pinot.core.io.util.FixedByteValueReaderWriter The idea is to avoid padding and support storing byte arrays/strings/maps of different length while not slowing down the lookups much (obviously).

Why? Fixed size based storage is good for fast lookups but it’s very inefficient for the storage. For example, if we have a String column and the size of the biggest string value is 100 bytes but the average size is only 10 bytes, there is about 90% padding. The same thing applies for byte[], maps, etc.

How? Currently, FixedByteValueReaderWriter only writes the sorted values in the buffer directly starting from “0” offset and at fixed lengths. So, first Int is at index “0” and the second one at index “4”, etc. There is no additional metadata needed in the buffer. The idea is to maintain the index of each element at the beginning of the buffer so that the element sizes needn’t be fixed. When looking up an element from the buffer, we first get it’s offset and then read the actual element. This means we do two reads from the buffer (first int offset and then the actual element) but the offset read should be fast enough so it shouldn’t slow down the overall operation that much.

Few things to note:

  • If all values of a byte[], string or map column have fixed length, this approach rather adds storage overhead and one additional lookup and might not be preferable. Hence, we can have a flag/property at the column level to decide whether to use the VarLengthByteValueReaderWriter or not.
  • Backward compatibility shouldn’t be broken, which means we need to introduce some kind of header into the buffer to be able to distinguish the on-disk storage format.
  • Need to run Benchmarks to see the lookup overhead added by this approach.
  • If possible, we should do some benchmarking to get the storage savings with the new approach so that we can make data-driven decisions.

Thanks @kishoreg for pointing this problem and brainstorming.

P.S: This was originally tracked in https://github.com/winedepot/pinot/issues/24

Issue Analytics

  • State:closed
  • Created 4 years ago
  • Comments:12 (12 by maintainers)

github_iconTop GitHub Comments

1reaction
kishoregcommented, Jun 20, 2019

@kishoreg once you get the feature in, if you move to use the existing Buffer class, won’t it be incompatbile? Isn’t it better to just make the Buffer class public, do all the development using Buffer, and let the IDE pull out the Buffer class for free? That way you keep compatiblity across checkins

@mcvsubbu yes that’s a possibility. Is that something you can work on? Might be easier for you to pull it out since you know the context. Once you pull it out @buchireddy can update this PR.

0reactions
buchireddycommented, Jul 11, 2019

Merged this feature as part of https://github.com/apache/incubator-pinot/pull/4321 and hence closing the issue.

Read more comments on GitHub >

github_iconTop Results From Across the Web

Support variable length Offline Dictionary Indexes for bytes ...
Currently, the dictionary index for offline segments for bytes and string types uses Fixed-size storage for each value (by picking the size ......
Read more >
Supported data types | Firestore - Firebase
This page describes the data types that Cloud Firestore supports. ... If two maps start with the same key-value pairs, then map length...
Read more >
Using dictionaries to store data as key-value pairs
The dictionary stores objects as key-value pairs and can be used to represent complex real-world data.
Read more >
Table - Apache Pinot Docs
The list of columns for which the variable length dictionary needs to be enabled in offline segments. This is only valid for string...
Read more >
Collation and Unicode support - SQL Server - Microsoft Learn
Learn about collation and Unicode support in SQL Server. ... Full-text search indexes support only Accent-Sensitive (_AS), Kana-sensitive ...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found