question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Store HNSW graph connections more compactly

See original GitHub issue

Description

HNSW search is most efficient when all vector data fits in page cache. So good to keep the size of vector files as small as possible.

We currently write all HNSW graph connections as fixed-size integers. This is wasteful since most graphs have far fewer nodes than the max integer value: https://github.com/apache/lucene/blob/d2e22e18c6c92b6a6ba0bbc26d78b5e82832f956/lucene/core/src/java/org/apache/lucene/codecs/lucene94/Lucene94HnswVectorsWriter.java#L478

Maybe instead we could store the connection list using PackedInts.Writer. This would decrease the bits needed to store each connection. We could still ensure that every connection list takes the same number of bytes, to continue being able to index into the graph data easily.

I quickly tested the idea on the 1 million vector GloVe dataset, and saw the graph data size decrease by ~30%:

Baseline
155M	luceneknn-100-16-100.train-16-100.index/_6_Lucene94HnswVectorsFormat_0.vex

Hacky patch
103M	luceneknn-100-16-100.train-16-100.index/_5_Lucene94HnswVectorsFormat_0.vex

Issue Analytics

  • State:closed
  • Created a year ago
  • Reactions:3
  • Comments:10 (10 by maintainers)

github_iconTop GitHub Comments

1reaction
benwtrentcommented, Nov 17, 2022

I changed the PR to move towards delta encoding & vint. Even with storing the memory offsets within vex, the storage improvements are much better than PackedInts.

Table with some numbers around the size improvements for different data sets & parameters:

packed_vex_mb_size vex_mb_size packed_index_build_time index_build_time params dataset percent_reduction
79.9 161.6 767 784 “{‘M’: 16, ‘efConstruction’: 100}” glove-100-angular 50.55693069
108.4 464.1 1138 1225 “{‘M’: 48, ‘efConstruction’: 100}” glove-100-angular 76.64296488
2.3 8.2 36 36 “{‘M’: 16, ‘efConstruction’: 100}” mnist-784-euclidean 71.95121951
2.4 23.5 36 36 “{‘M’: 48, ‘efConstruction’: 100}” mnist-784-euclidean 89.78723404
66.1 392.2 501 572 “{‘M’: 48, ‘efConstruction’: 100}” sift-128-euclidean 83.1463539
59.7 136.6 449 516 “{‘M’: 16, ‘efConstruction’: 100}” sift-128-euclidean 56.29575403

For the curious, here are the QPS numbers (higher is better) for packed (delta & vint) vs baseline:

Glove

glove-100-angular

MNist

mnist-784-euclidean

SIFT

sift-128-euclidean

1reaction
benwtrentcommented, Oct 6, 2022

Could you include a comparison of the search latency and recall numbers you see with this approach? Sometimes with our benchmarks it’s easy to miss small differences in performance.

Yes, I can provide some images and numbers. What I saw when running glove-100-angular batched is that there is no difference. Will run more tests for sure.

Read more comments on GitHub >

github_iconTop Results From Across the Web

GITHUB-11830 Better optimize storage for vector connections ...
This commit proposes storing node connections within the graph with using delta encoding and ... Store HNSW graph connections more compactly.
Read more >
Hierarchical Navigable Small Worlds (HNSW) - Pinecone
Hierarchical Navigable Small World (HNSW) graphs are among the top-performing indexes for vector similarity search. HNSW is a hugely popular technology that ...
Read more >
Billion-scale vector search using hybrid HNSW-IF - Medium
The first blog post on billion-scale vector search covered methods for compressing real-valued vectors to binary representations and using hamming distance ...
Read more >
Fast indexing with graphs and compact regression codes
This estimator is further im- proved if we can store the weights on a per-vector basis. • We show that HNSW offers a...
Read more >
Revisiting k-Nearest Neighbor Graph Construction on High ...
We can see that LargeVis requires the most memory space due to storing 50 RP trees, followed by HNSW. KNNG and SW KNNG...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found