Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

[Proposal] Bitmap Index File Format for V2 segment

See original GitHub issue

Background and Goal

Bitmap index is very popular among OLAP systems. A bitmap index contains two parts.

First, there is an ordered dictionary which contains all distinct values of a column and maps each value to an id. The first value is mapped to 0, second mapped to 1, and the last mapped to N-1 where N is the cardinality of the column. The ordered dictionary enables fast range search on its key.
Second, there is a posting list which stores one bitmap for each value in the dictionary. The bitmap is used to represent the list of rowid where a particular value exists. E.g, if the column contains 10 rows and value ‘x’ exists in rowid [1, 2, 3, 7], then the bitmap for value ‘x’ will be [1 1 1 0 0 0 1 0 0 0] where the n-th bit is 1 if the n-th row contains value ‘x’ and 0 otherwise. The posting list supports fast retrieval of the n-th bitmap.

To support bitmap index in Doris, we first design an efficient file format for it. The goals for such file format are

page based format : we can reuse page encoding, compression, and cache utility already built for V2 segment
separate from data files : new index can be added without modifying data files and decoding rows doesn’t require reading index (otherwise performance of point query will downgrade)
space efficient and fast range search : dictionary and bitmaps should be compressed without hurting too much read performance
memory efficient : search for a particular key only needs to load a few pages and enable computing directly on compressed page
extendable : the format should be easy to extend in the future

Now we present the detailed design to fulfill above requirements.

Overall index file structure

Each segment file in Doris can have an accompanied bitmap index file. Since a segment file contains at most INT32_MAX rows, rowid ranges from 0 to INT32_MAX - 1. To reduce the number of files and open FDs, an index file contains bitmap indexes of all indexed columns. The overall file structure are like

<beginning of the file>
[Bitmap Index for Column 1]
[Bitmap Index for Column 2]
...
[Bitmap Index for Column n]
[Footer] // FooterPB, FooterPBSize(4), FooterCRC(4), MagicNumber(4)
<end of the file>

To add a new bitmap index to a table, there are several approaches to consider.

The first approach (shown below) is to copy existing indexes to a new file and then appends new index and footer to it. It’s easy to implement but incurs additional I/O and invalidates cache for existing index page.

<beginning of the file>
[Bitmap Index for Column 1]  // copied
[Bitmap Index for Column 2]  // copied
...
[Bitmap Index for Column n]  // copied
[Bitmap Index for Column n+1]  // added
[Footer]  // added
<end of the file>

The second approach (shown below) is to hard link old index file and appends new index and then footer to it. It has less I/O but is more complicated to implement and still has the cache invalidation problem.

<beginning of the file>
[Bitmap Index for Column 1]
[Bitmap Index for Column 2]
...
[Bitmap Index for Column n]
[Footer] // old footer (ignored)
[Bitmap Index for Column n+1]  // added
[Footer]  // new footer added
<end of the file>

More approaches can be expolored and discussed. For the purpose of this proposal, let’s move on to the layout of bitmap index (see below).

// dictionary section
[DictDataPage 1]
...
[DictDataPage N]
[DictIndexPage] // used for locating DictDataPage for a particular key
// bitmap section
[BitmapDataPage 1]
...
[BitmapDataPage M]
[BitmapIndexPage] // used for locating BitmapDataPage for a particular bitmap (indexed by dict id)

Each bitmap index contains two section. The “dictionary section” stores ordered dictionary and is composed of several DictDataPage and one DictIndexPage. The encoding of DictDataPage and DictIndexPage depends on the type of the indexed column. The “bitmap section” stores posting list and is composed of several BitmapDataPage and one BitmapIndexPage. The encoding of BitmapDataPage and BitmapIndexPage is the same for all column types.

Search operation usually requires two I/O, one for index page and one for data page. There is an optimization, though. When we have only one data page, index page is not needed. E.g, a bitmap index with one DictDataPage and one BitmapDataPage contains only two pages, we don’t need to store DictIndexPage and BitmapIndexPage.

Encoding for bitmap section

To achieve good space and time efficiency, RoaringBitmap is used as the main bitmap structure.

BinaryPlainPage is used to encode BitmapDataPage. Each entry is a serialized RoaringBitmap. Since bitmaps are already highly compressed, we don’t need to further compress BitmapDataPage.

Bitmaps are stored sequentially according to its dict ID. To efficient locate the data page that contains the bitmap of a particular ID, BitmapIndexPage stores the first ID and file pointer of each BitmapDataPage. OrdinalIndexPage can be used to encode BitmapIndexPage.

Encoding for string dictionary

String dictionary uses prefix encoding and format similar to LevelDB’s DataBlock and IndexBlock. The differences are that we only store the key part because the value (dict ID) in an ordered dictionary can be calculated as the ordinal of the key.

A new encoding called DeltaBinaryPage is added to encode string’s DictDataPage. Besides we can also leverage DeltaBinaryPage to encode short key index due to its good compression and search performance on sorted keys.

DeltaBinaryPage := Entry^EntryNum, Trailer
Entry := SharedPrefixLength(vint), UnsharedLength(vint), Byte^UnsharedLength
Trailer := RestartPointStartOffset(4)^NumRestartPoints, NumRestartPoints(4)

Note that
- In delta encoding, for each element in a sequence of strings, we store the shared prefix length of the previous entry plus the unshared suffix.
- Every 16 elements, we create a restart point. The element at restart point will be stored fully (SharedPrefixLength == 0)
- Since elements in DictDataPage are sorted, to search for an element, we first binary search the restart points, then sequentially decode and search elements after it
- EntryNum is not stored because the start offset of Trailer is known when `NumRestartPoints` is read and it's exactly where the last element ends

We use BinaryPlainPage to encode DictIndexPage, it contains one entry for each DictDataPage. For string type, the entry format are

DictIndexPage(string)'s entry := SplitKeyLength(vint), SplitKey, PageStartId(4), PageOffset(vlong), PageSize(vint)

It contains the split key, start id, and location for each data page. Start ID is used to calculate the ordinal of a particular key. Split key is a string >= last key in the corresponding data block and < the first key in the successive data block. For simplicity, last key of the data page can be used. But shorter key can be used using optimization similar to LevelDB.

Encoding for integer dictionary

Integer dictionary contains sorted list of integers, which can be efficiently encoded using frame of reference.

A new encoding called OrderedIntPage is added to encode integer’s DictDataPage.

OrderedInt32Page := BitPackingFrame^(FrameCount-1), VarIntFrame, FrameOffset(4)^FrameCount, FrameCount(4), NumValues(4)
BitPackingFrame := FirstValue(4), Delta(BitWidth)^128
VarIntFrame := FirstValue(4), Delta(vint)^(NumValues - FrameCount * 128), Padding

Note that
- Each frame except the last contains 128 values and uses bit packing to store deltas. The last frame may contains less than 128 values and it will switch to store delta as vint if that happens.
- Frame size must be a multiple of 4. BitPackingFrame guarantees this property. VarIntFrame will insert padding to ensure it.
- Each frame can be decoded independently. The first (also minimum) value of each frame is stored. When search for a particular value, we first use FrameOffset to binary search the first value of each frame, then decode the frame to search sequentially.
- BitWidth for BitPackingFrame is not stored because it can be calculated from FrameOffset: 
BitWidth[i] = (FrameOffset[i+1] - FrameOffset[i] - 4) * 8 / 128
- For 64-bit integers, the only differences are FirstValue is 8 bytes, Delta in VarIntFrame uses vlong instead of vint, and BitWidth for BitPackingFrame ranges from 0 to 64

We use BinaryPlainPage to encode integer’s DictIndexPage. The entry format are

DictIndexPage(int32)'s entry := LastValue(4), PageStartId(4), PageOffset(vlong), PageSize(vint)
DictIndexPage(int64)'s entry := LastValue(8), PageStartId(4), PageOffset(vlong), PageSize(vint)

Footer

FooterPB contains encoding, compression, and index page location for each column’s bitmap index. It allows us to choose different encoding/compression schemes in the future.

enum PostingsTypePB {
    UNKNOWN_ENCODING = 0;
    ROARING_BITMAP = 1;
}

message BitmapIndexSectionPB {
    optional uint32 num_pages = 1;
    optional EncodingTypePB data_page_encoding = 2;
    optional EncodingTypePB index_page_encoding = 3;
    // point to data page when `num_pages == 1`, otherwise point to index page
    optional PagePointerPB page_pointer = 4;
    optional uint64 data_page_size = 5;
    optional uint64 index_page_size = 6;
    optional CompressionTypePB data_page_compression = 7;
    optional CompressionTypePB index_page_compression = 8;
}

message BitmapIndexColumnPB {
    optional uint32 column_id = 1;
    optional uint32 unique_id = 2;
    optional int32 type = 3;
    optional BitmapIndexSectionPB dict = 4;
    optional BitmapIndexSectionPB postings = 5;
    optional PostingsTypePB postings_type = 6;
}

message BitmapIndexFileFooterPB {
    optional uint32 versoin = 1 [default = 1];
    repeated BitmapIndexColumnPB columns = 2;
}

Issue Analytics

State:
Created 4 years ago
Comments:10 (9 by maintainers)

Top GitHub Comments

1reaction

imaycommented, Aug 29, 2019

@imay thanks for the comments.

If we support other type of index, for example B+Tree, do we need to store all index type in one index file?

No we don’t. Different types of index will have different structure and metadata, I don’t think it’s a good idea write them into a single file.

I think we should keep file immutable, so I don’t agree with the second option to add new index. What about having multiple index files for one data file? If we add a new index, we can create a new index file for it. and we can merge all index files into one index file when we do compaction.

Agree. In this way the I/O cost of adding a new index is minimum, but reader now needs to know what indexes are available for this segment and the place of each index. Where do you think is the best place for such information? Perhaps in RowsetMetaPB?

I think it’s ok to save this information in RowsetMetaPB. RowsetMetaPB is seen as immutable, when we create a new index for rowset, we will create a new RowsetMetaPB for new rowset, and leverages link-schema-change to create hard link of old files. Then we can generate index for this rowset and save this information in new RowsetMetaPB.

Now, we store dictionary in two levels, how much data size does this can support? Why not we abstract this to a B-tree which can support multiple layer. here is kudu’s implementation which we can reference.

Since the dictionary is per segment, it won’t be very large. So I think two-level is enough for most dataset. It’s also a lot simpler and easier to implement than B-Tree. I think we should first start with the simpler approach and switch to the more complicated approach only when it’s proven to be really necessary. Thanks for the link, I’ll take a look.

If we can make it easy to change, I think it’s OK.

1reaction

gaodayuecommented, Aug 29, 2019

@imay thanks for the comments.

If we support other type of index, for example B+Tree, do we need to store all index type in one index file?

No we don’t. Different types of index will have different structure and metadata, I don’t think it’s a good idea write them into a single file.

I think we should keep file immutable, so I don’t agree with the second option to add new index. What about having multiple index files for one data file? If we add a new index, we can create a new index file for it. and we can merge all index files into one index file when we do compaction.

Agree. In this way the I/O cost of adding a new index is minimum, but reader now needs to know what indexes are available for this segment and the place of each index. Where do you think is the best place for such information? Perhaps in RowsetMetaPB?

Now, we store dictionary in two levels, how much data size does this can support? Why not we abstract this to a B-tree which can support multiple layer. here is kudu’s implementation which we can reference.

Since the dictionary is per segment, it won’t be very large. So I think two-level is enough for most dataset. It’s also a lot simpler and easier to implement than B-Tree. I think we should first start with the simpler approach and switch to the more complicated approach only when it’s proven to be really necessary. Thanks for the link, I’ll take a look.