Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Store HNSW graph connections more compactly

See original GitHub issue

Description

HNSW search is most efficient when all vector data fits in page cache. So good to keep the size of vector files as small as possible.

We currently write all HNSW graph connections as fixed-size integers. This is wasteful since most graphs have far fewer nodes than the max integer value: https://github.com/apache/lucene/blob/d2e22e18c6c92b6a6ba0bbc26d78b5e82832f956/lucene/core/src/java/org/apache/lucene/codecs/lucene94/Lucene94HnswVectorsWriter.java#L478

Maybe instead we could store the connection list using PackedInts.Writer. This would decrease the bits needed to store each connection. We could still ensure that every connection list takes the same number of bytes, to continue being able to index into the graph data easily.

I quickly tested the idea on the 1 million vector GloVe dataset, and saw the graph data size decrease by ~30%:

Baseline
155M	luceneknn-100-16-100.train-16-100.index/_6_Lucene94HnswVectorsFormat_0.vex

Hacky patch
103M	luceneknn-100-16-100.train-16-100.index/_5_Lucene94HnswVectorsFormat_0.vex

Issue Analytics

State:
Created a year ago
Reactions:3
Comments:10 (10 by maintainers)

Top GitHub Comments

1reaction

benwtrentcommented, Nov 17, 2022

I changed the PR to move towards delta encoding & vint. Even with storing the memory offsets within vex, the storage improvements are much better than PackedInts.

Table with some numbers around the size improvements for different data sets & parameters:

packed_vex_mb_size	vex_mb_size	packed_index_build_time	index_build_time	params	dataset	percent_reduction
79.9	161.6	767	784	“{‘M’: 16, ‘efConstruction’: 100}”	glove-100-angular	50.55693069
108.4	464.1	1138	1225	“{‘M’: 48, ‘efConstruction’: 100}”	glove-100-angular	76.64296488
2.3	8.2	36	36	“{‘M’: 16, ‘efConstruction’: 100}”	mnist-784-euclidean	71.95121951
2.4	23.5	36	36	“{‘M’: 48, ‘efConstruction’: 100}”	mnist-784-euclidean	89.78723404
66.1	392.2	501	572	“{‘M’: 48, ‘efConstruction’: 100}”	sift-128-euclidean	83.1463539
59.7	136.6	449	516	“{‘M’: 16, ‘efConstruction’: 100}”	sift-128-euclidean	56.29575403

For the curious, here are the QPS numbers (higher is better) for packed (delta & vint) vs baseline:

Glove

glove-100-angular

MNist

mnist-784-euclidean

SIFT

sift-128-euclidean

1reaction

benwtrentcommented, Oct 6, 2022

Could you include a comparison of the search latency and recall numbers you see with this approach? Sometimes with our benchmarks it’s easy to miss small differences in performance.

Yes, I can provide some images and numbers. What I saw when running glove-100-angular batched is that there is no difference. Will run more tests for sure.

Top Results From Across the Web

GITHUB-11830 Better optimize storage for vector connections ...

This commit proposes storing node connections within the graph with using delta encoding and ... Store HNSW graph connections more compactly.

Hierarchical Navigable Small Worlds (HNSW) - Pinecone

Hierarchical Navigable Small World (HNSW) graphs are among the top-performing indexes for vector similarity search. HNSW is a hugely popular technology that ...

Billion-scale vector search using hybrid HNSW-IF - Medium

The first blog post on billion-scale vector search covered methods for compressing real-valued vectors to binary representations and using hamming distance ...

Fast indexing with graphs and compact regression codes

This estimator is further im- proved if we can store the weights on a per-vector basis. • We show that HNSW offers a...

Revisiting k-Nearest Neighbor Graph Construction on High ...

We can see that LargeVis requires the most memory space due to storing 50 RP trees, followed by HNSW. KNNG and SW KNNG...