[RFC] Lucene based kNN search support in core OpenSearch
See original GitHub issueThis document proposes integrating Lucene 9.0’s recent addition of Approximate Nearest Neighbor search into OpenSearch.
Problem Statement
k-NN Search has gained popularity over recent years with the growing use of ML. A few applications include recommendation systems, NLP services, and computer vision.
A brute force nearest neighbor search consists of calculating the distance from a query point to every point in the index and returning the “k” index points that have the smallest distance. A point can be a dense vector, a sparse vector or a binary code; however, in this document, we are only interested in dense vectors. The major drawback of this algorithm is that it is unable to search indices with millions or billions of vectors efficiently. To address this, a new problem was formulated: Approximate Nearest Neighbor search or ANN for short. The general idea is to sacrifice search quality and index latency to drastically improve search latency. Several algorithms have been created, including Hierarchical Small Worlds (HNSW), Inverted File System (IVF), and Locality Sensitive Hashing (LSH).
Several different libraries exist that implement these algorithms, including nmslib and faiss. The libraries are similar to Lucene in that they build indices that can be searched and serialized — however, they are built specifically for vector search. The k-NN plugin integrates these libraries into OpenSearch so that OpenSearch users can perform ANN search workloads.
The approach the plugin takes has a few drawbacks:
- The libraries mentioned above are implemented in C++. In a Java project like OpenSearch, this is a bit unnatural — it complicates the build process and prevents us from supporting different systems, such as Windows.
- The library indices must be loaded in totality into native memory in order to search them. The indices contain both the vectors and the underlying data structures. This causes issues with memory pressure when memory is tight.
- Developers that want to use vectors in their applications have to take a dependency on the k-NN plugin to get access to the knn_vector data type the plugin introduces.
Proposed Solution
TLDR: Build a new field type and query type in OpenSearch to support ANN search through Lucene.
Lucene 9’s introduction of Approximate Nearest Neighbor Support
In the 9.0 release, Lucene added support for building indices with dense vectors and running ANN search on them. In order to create indices with vectors, a new Codec format was introduced, KnnVectorsFormat. This format is used to encode and decode numeric vector fields and execute Nearest Neighbor search. In Lucene 9, the default format used is Lucene90HnswVectorsFormat. This format implements the HNSW algorithm for ANN search. During indexing, this format will create 2 separate segment files: one for the vectors and one for the HNSW graph structure, allowing the vectors to exist off heap.
Lucene’s implementation of HNSW takes two parameters at index time: max_connections and beam_width. Max_connections sets a ceiling on the number of connections a node in the graph can have. Beam_width is a parameter that controls the candidate list for neighbors of a new node added to the graph. These parameters map to m and ef_construction parameters from the original paper. These parameters are set in Lucene91HnswVectorsFormat’s constructor. Codec supports KnnVectorFormat at the field level, and this allows setting these parameters at the field level.
For search, a new query was introduced, the KnnVectorQuery. This query contains the query vector as well as k, the number of results a query on the HNSW graph should return. For search, the HNSW algorithm has one parameter, ef_search. In Lucene’s implementation this value is hardcoded to k.
Integration with OpenSearch
In order to integrate this new functionality into OpenSearch, we need to introduce a new field type, dense_vector. This field type would allow users to index floating point arrays of uniform length into OpenSearch and build the index structures needed to support ANN search with Lucene.
{
"type": "dense_vector",
"dtype": float,
"dimension": d,
"knn": {
"metric": "l2",
"algorithm": {
"name": "hnsw",
"params": {
"max_connections": X,
"beam_width": X
}
}
}
}
Because initially we are going to provide support for Lucene HNSW implementation format above can be simplified to a following form
{
"type": "dense_vector",
"dimension": d,
"knn": {
}
}
For provided request example system will use Lucene HNSW with L2 metric, max_connections equals 16 and beam_width equals 100.
The minimal default mapping for dense_vector does not include knn structure and is shown below. For such definition the proposed knn based on Lucene HNSW will not be enabled and related graph data structures are not created; only vector values are stored.
{
"type": "dense_vector",
"dimension": d
}
In addition to the field type, we are adding support a new query type naming knn-query.
Lucene implementation of HNSW and knn vector type has certain limitations that we have to accept with such integration - maximum number for dimensions for knn vector is 1024, this implies same limitation for new dense_vector field type.
Performance
Team has benchmarked prototype of solution based on Lucene 9.1 integration with OpenSearch 2.0.
For benchmark algorithm parameters ef_construction and beam_width set to 512, and max_connections and m varied in range from 4 to 96.
Benchmark results showed that Lucene 9.1 solution is comparable with existing k-NN hnsw implementation based on nmslib with certain tradeoffs: Lucene 9.1 solution cannot reach high recall values close to 1.0, but when recall values are comparable with existing k-NN plugin Lucene 9.1 solution consumes less resources and has better query latencies.
We have observed some trends in benchmark results - for the same high recall values memory that the k-NN index takes is higher than required for Lucene implementation. Query latencies and memory allocated per query are higher for k-NN implementation.
Solution shows itself as stable under constant query load that is typical for k-NN plugin.
Obtained metrics are rough as they are based on POC code of Lucene 9.1 integration with OpenSearch 2.0 without any optimizations. For production ready version we’re planning to have more accurate metrics.
Pre-filtering feature
Lucene implementation of HNSW gives additional benefit of having pre-filtering on queries. Current implementation in k-NN provides post-filtering - first documents are selected by applying HNSW algorithm and then we apply filter on top of HNSW results. In general results are not exactly correct due to sequence in which different processing steps are applied.
Lucene 9.1 applies filter on live-docs and then selects k closest results by applying HNSW algorithm.
At the moment we’re planning to deliver pre-filtering feature “as is”, no changes/additions to what is implemented in Lucene.
Impact on k-NN plugin
Existing knn_vector field supported by k-NN plugin will be independent of the dense_vector field introduced in core OpenSearch. This change will not break any functionality of the plugin, and customers will be able to upgrade their existing indices created in earlier k-NN versions taking into account limitation on maximum vector dimensions.
From a migration standpoint, users would be able to reindex their knn_vector indices into indices that use the new dense_vector field type and vice versa.
In order to share the knn query type with OpenSearch, the plugin’s Codec would be required to implement the new KnnVectorsFormat. More details on this will be available in the plugins repository.
Revised Approach
Overview
After some offline discussions, we have decided to propose a pivot to our plan to support Lucene k-NN.
Introducing a new data type , dense_vector, and query type, knn_query, when we already have the data type, knn_vector, and the query type, knn, in the plugin will be a source of confusion for users. Having two data types with very similar end user functionality will raise questions such as “Why aren’t they the same?”, “When should I use what?”, etc. Instead of introducing dense_vector, we are proposing to capture the Lucene functionality inside the existing knn_vector field type in the plugin. Then, we would use the existing knn query type as well. The interface would be consistent with our current knn_vector.
As a consequence of this change, because only one field mapping parser can be registered to a field name (as well as query parser to query name), we would need to move the implementation to the k-NN plugin. We would still target the 2.2 release.
Using existing knn_vector field type instead of new dense_vector
Original Proposal Limitations
User Experience
The main limitation to this approach is the confusing user experience that it creates. We would have 2 different data types that do very similar things: knn_vector and dense_vector. Each would support indexing dense vectors and running approximate Nearest Neighbor searches. In addition to this, we would have 2 different query types that would do a similar thing: knn_query and knn.
The decision to make a new data type and query type was made originally to separate from architecture specific nature of the k-NN plugin. However, on closer examination, a user could use the k-NN plugin without the native libraries on a different platform as long as we update plugin build infrastructure to decouple platform dependent components from platform independent components.
Updated Proposal
To solve the problem of user confusion, we would build Lucene ANN support into our current knn_vector interface:
{
"type": "knn_vector",
"dimension": 100,
"method": {
"name":"hnsw",
"engine":"lucene",
"space_type": "l2",
"parameters":{
"m":2048,
"ef_construction": 245
}
}
}
The interface would stay the same except for adding lucene as a new engine to support. We have evaluated the technical feasibility of this approach and deemed it feasible.
With this, the query would stay the same as our current query for the plugin:
"query": {
"knn": {
"target_field": {
"vector": [<insert vector>],
"k": 2
}
}
}
Note that in the future, we could update these interfaces to how we see fit. For more information on current interface, please take a look at the k-NN documentation.
Updated Proposal Limitations
Initial Coupling to Architecture Dependent Nature of Plugin
One limitation with the updated proposal is that it couples the Lucene ANN platform agnostic functionality with the plugin’s platform dependence.
However, a user that would want to run the plugin on a non-plugin supported platform, could install the plugin zip without the libraries and use the Lucene engine. As it stands right now, if they tried to use nmslib or faiss, the process would crash.
Overall, we would solve this problem by decoupling the JNI library from the plugin and providing a graceful failure if a user tries to use the JNI components on an unsupported platform. Additionally, we would decouple the build process as well.
Difficulty integrating with other engine plugins
Another limitation lies in a more technical nuance. From the user perspective, the consequence is that they could not create a Lucene k-NN index with another Engine plugin feature like CCR.
This is because, we have to pass the HNSW parameters to the Lucene KnnVectorsFormat during construction. From a plugin, we have no way to add this to the default Codec that OpenSearch will use. Therefore, we will need to use our own custom codec. But, two custom codecs cannot be used at the same time for an index.
In the short term, this will mean users cannot use the Lucene KNN with another feature like CCR. This limitation is currently present for the plugin as well.
Related github issues:
- https://github.com/opensearch-project/OpenSearch/issues/1254
- https://github.com/opensearch-project/OpenSearch/issues/2447
- https://github.com/opensearch-project/k-NN/issues/147
Feedback Requested
Any general comments about the overall direction are welcome. Some specific questions:
- Do you prefer exposing Lucene implementation via existing knn_vector field or by new dense_vector field?
- Do you have use cases where close to 1.0 recall is not critical for k-NN search? For such use cases are you willing to make a tradeoff of lower recall comparing to a less memory consumption and lower query latencies?
- Is L2 Euclidean distance a good default metric for your use cases? Other metrics supported by Lucene are
Dot Product
andCosine Similarity
.
Issue Analytics
- State:
- Created a year ago
- Reactions:4
- Comments:18 (14 by maintainers)
Top GitHub Comments
I’m happy to see this issue. I think it’s one of long-term strategic importance for OpenSearch, given the current momentum of neural search in industry and academia.
I think it’s perfectly acceptable to introduce the functionality as part of the k-NN plugin.
We are still planning to add support for filtering on top of Lucene knn engine, it’s planned currently for 2.4. The way it’s implemented in Lucene makes it rather a hybrid approach of both pre and post filtering, but in any case top k elements are guaranteed. This is due to details of pre-filtering implementation, Lucene keeps traversing HNSW graphs and adding vectors to result until it finds top k vectors.