question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Possible storage optimization for MV forward index

See original GitHub issue

We recently observed a huge increase on table size upon adding a MV column. Table size went from 2TB to 12TB.

Column was dictionary encoded (back then raw MV column support wasn’t added or was in the middle of being added). Column is needed in WHERE clause so it is dictionary encoded along with inverted index. Both forward and inverted index have resulted in size increase. Some stats from a sample segment:

mvCol.cardinality = 131483 mvCol.totalDocs = 287714 mvCol.dictionary.size = 525940 mvCol.forward_index.size = 800860697 mvCol.inverted_index.size = 553252156 mvCol.maxNumberOfMultiValues = 16649 mvCol.totalNumberOfEntries = 336962215

The numDocs in segment is fairly low 287k. Given that totalNumberOfEntries is around 336million, cardinality is super low around 131k.

The fwd index size of 800MB is majorly coming from rawDataSize section computed as :

rawDataSize = ((long) totalNumValues * numBitsPerValue + 7) / 8;

numBitsPerValue is 18 given that cardinality is 131k

So dictId for each of the 336million values is encoded with 18 bits in the rawData section.

  • Haven’t thought much about a solution yet but given that there is so much duplicate data (and I have checked that there are repetitive runs) in the above sample, one potential way could be to use RLE along with bit packing where run length itself is bit-packed and/or a hybrid combination of RLE and bit-packing that switches between the two depending on data (something like what Parquet does).

  • Another solution would be to consider variable length bit encoding where instead of current way of using fixed number of bits (max number of bits essentially) for each dictId, we use based on the value. But in this case, a fixed 5 bits per dictId are going to be needed to indicate how many bits are used to encode the dictId

  • Another way could be to have sort of dictionary on top of dictionary – this can work if entire array is duplicated. So for example, if dictId array [1, 2, 3, 4, 5, 10] is appearing multiple times across rows/docs, we create a dictId for this dictId array and use the former dictId to encode the array values instead of using array of dictIds.

  • We can also have a way of encoding the fwd index of the column using general purpose compression schemes (like LZ4) and still keep a dictionary structure. Currently, if we enable LZ4 / SNAPPY / ZSTD codecs on a column, it must be marked as noDictionary

Issue Analytics

  • State:open
  • Created 2 years ago
  • Comments:12 (10 by maintainers)

github_iconTop GitHub Comments

3reactions
somandalcommented, Jul 12, 2022

Hey @Jackie-Jiang @walterddr @richardstartin We ran some compression experiments to assess the best approach to use to solve this problem based on the 4th solution mentioned in the original comment in the issue description. The results and the recommendation are summarized in this document: https://docs.google.com/document/d/1BWtNKvxL1Uaydni_BJCgWN8i9_WeSdgL3Ksh4IpY_K0/edit?usp=sharing

Can you folks take a look and get back to us with your comments / feedback?

1reaction
kishoregcommented, Dec 5, 2021

Even though the number of docs is low, the total number of entries is quite high - 336 million. There is not much we can do to reduce the size of forward index without sacrificing the access speed. One option would be to eliminate the forward index if it’s not needed post-filtering and just keep the inverted index similar to what we do for some text index columns.

One thing that stands out is that the Inverted index size (500 MB) seems quite high for 336 million. Can you verify if we do the runCompress on the bitmap. cc @richardstartin added something here recently.

I like the 4th idea, will be great to apply compression on any column with or without dictionary encoding

Read more comments on GitHub >

github_iconTop Results From Across the Web

Getting Started with File Storage Optimization
Complete the guided setup for File Storage Optimization. The guided setup guides you through creating a data classification plan, an index ...
Read more >
Intelligently Reducing SharePoint Costs through Storage ...
Suddenly, all that shared data is centrally available through a single interface. Page 10. Intelligently Reducing SharePoint Costs through Storage Optimization.
Read more >
Design Documents - Apache Pinot Docs
Here we introduce star-tree index to utilize the pre-aggregated documents in a smart way to achieve low query latencies but also use the...
Read more >
Optimizing indexing of large document collections
The dtSearch indexer is optimized for indexing large volumes of text at once. ... Use SSD storage for the index and, if possible,...
Read more >
MV-PBT: Multi-Version Indexing for Large Datasets and HTAP ...
In-place and out-of-place update strategies are are possible for both methods. Considering the characteristics of modern storage technolo- gies new-to-old ...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found