Store HNSW graph connections more compactly
See original GitHub issueDescription
HNSW search is most efficient when all vector data fits in page cache. So good to keep the size of vector files as small as possible.
We currently write all HNSW graph connections as fixed-size integers. This is wasteful since most graphs have far fewer nodes than the max integer value: https://github.com/apache/lucene/blob/d2e22e18c6c92b6a6ba0bbc26d78b5e82832f956/lucene/core/src/java/org/apache/lucene/codecs/lucene94/Lucene94HnswVectorsWriter.java#L478
Maybe instead we could store the connection list using PackedInts.Writer
. This would decrease the bits needed to store each connection. We could still ensure that every connection list takes the same number of bytes, to continue being able to index into the graph data easily.
I quickly tested the idea on the 1 million vector GloVe dataset, and saw the graph data size decrease by ~30%:
Baseline
155M luceneknn-100-16-100.train-16-100.index/_6_Lucene94HnswVectorsFormat_0.vex
Hacky patch
103M luceneknn-100-16-100.train-16-100.index/_5_Lucene94HnswVectorsFormat_0.vex
Issue Analytics
- State:
- Created a year ago
- Reactions:3
- Comments:10 (10 by maintainers)
Top GitHub Comments
I changed the PR to move towards delta encoding & vint. Even with storing the memory offsets within
vex
, the storage improvements are much better than PackedInts.Table with some numbers around the size improvements for different data sets & parameters:
For the curious, here are the QPS numbers (higher is better) for packed (delta & vint) vs baseline:
Glove
MNist
SIFT
Yes, I can provide some images and numbers. What I saw when running
glove-100-angular
batched is that there is no difference. Will run more tests for sure.