Possible storage optimization for MV forward index
See original GitHub issueWe recently observed a huge increase on table size upon adding a MV column. Table size went from 2TB to 12TB.
Column was dictionary encoded (back then raw MV column support wasn’t added or was in the middle of being added). Column is needed in WHERE clause so it is dictionary encoded along with inverted index. Both forward and inverted index have resulted in size increase. Some stats from a sample segment:
mvCol.cardinality = 131483 mvCol.totalDocs = 287714 mvCol.dictionary.size = 525940 mvCol.forward_index.size = 800860697 mvCol.inverted_index.size = 553252156 mvCol.maxNumberOfMultiValues = 16649 mvCol.totalNumberOfEntries = 336962215
The numDocs in segment is fairly low 287k. Given that totalNumberOfEntries is around 336million, cardinality is super low around 131k.
The fwd index size of 800MB is majorly coming from rawDataSize section computed as :
rawDataSize = ((long) totalNumValues * numBitsPerValue + 7) / 8;
numBitsPerValue is 18 given that cardinality is 131k
So dictId for each of the 336million values is encoded with 18 bits in the rawData section.
-
Haven’t thought much about a solution yet but given that there is so much duplicate data (and I have checked that there are repetitive runs) in the above sample, one potential way could be to use RLE along with bit packing where run length itself is bit-packed and/or a hybrid combination of RLE and bit-packing that switches between the two depending on data (something like what Parquet does).
-
Another solution would be to consider variable length bit encoding where instead of current way of using fixed number of bits (max number of bits essentially) for each dictId, we use based on the value. But in this case, a fixed 5 bits per dictId are going to be needed to indicate how many bits are used to encode the dictId
-
Another way could be to have sort of dictionary on top of dictionary – this can work if entire array is duplicated. So for example, if dictId array [1, 2, 3, 4, 5, 10] is appearing multiple times across rows/docs, we create a dictId for this dictId array and use the former dictId to encode the array values instead of using array of dictIds.
-
We can also have a way of encoding the fwd index of the column using general purpose compression schemes (like LZ4) and still keep a dictionary structure. Currently, if we enable LZ4 / SNAPPY / ZSTD codecs on a column, it must be marked as noDictionary
Issue Analytics
- State:
- Created 2 years ago
- Comments:12 (10 by maintainers)
Top GitHub Comments
Hey @Jackie-Jiang @walterddr @richardstartin We ran some compression experiments to assess the best approach to use to solve this problem based on the 4th solution mentioned in the original comment in the issue description. The results and the recommendation are summarized in this document: https://docs.google.com/document/d/1BWtNKvxL1Uaydni_BJCgWN8i9_WeSdgL3Ksh4IpY_K0/edit?usp=sharing
Can you folks take a look and get back to us with your comments / feedback?
Even though the number of docs is low, the total number of entries is quite high - 336 million. There is not much we can do to reduce the size of forward index without sacrificing the access speed. One option would be to eliminate the forward index if it’s not needed post-filtering and just keep the inverted index similar to what we do for some text index columns.
One thing that stands out is that the Inverted index size (500 MB) seems quite high for 336 million. Can you verify if we do the runCompress on the bitmap. cc @richardstartin added something here recently.
I like the 4th idea, will be great to apply compression on any column with or without dictionary encoding