Improve vector quantization API
See original GitHub issueDescription
Follow-up of https://github.com/apache/lucene/pull/11860#discussion_r1027106953: the API for quantization of vectors is a bit surprising at times in that you can index bytes but then still get float[] arrays whose values are bytes at search time. Can we make sure it’s either bytes all the way, or floats all the way?
I’m not entirely sure how best to move it forward. One option would be to have a disjoint set of APIs for the binary encoding, like for doc values: different Field
class, different VectorValues
class, different searchNearestNeighbors
method. I wonder if another option would be to make it floats all the way all the time, including for the byte encoding, ie. the Field
and VectorValues
class would still take and expose float
s but VectorValues#binaryValue
would be removed and the Field
constructor would barf if the float values do not represent exact bytes.
Issue Analytics
- State:
- Created 10 months ago
- Comments:11 (11 by maintainers)
Top GitHub Comments
I would tend to expect some organization like this:
indexing:
searching:
codec api:
OK, step one of this is done with the
KnnByteVectorQuery
. I am next approaching a newKnnByteVectorField
and this will cause some refactoring toLeafReader
orVectorValues
.Right now,
VectorValues
assumes all vector values arefloat[]
and “expands” the bytes, needlessly.There are many ways to approach this and I am honestly not sure the way to go here (being unfamiliar with typical API patterns).
It seems there are a handful of seemingly valid options:
VectorValues
(which is returned viaLeafReader
) that explicitly returnsbyteVectorValue()
and will throw ifvectorValue()
is called when the encoding isBYTE
(and throw onbyteVectorValue()
if encoding isFLOAT32
).AbstractVectorValues<T>
wherevectorValue()
returnsT
.LeafReader
would be changed to return that instead ofVectorValues
.LeafReader#getByteVectorValues()
that returns aByteVectorValues
object. The main downside here is that we are not making it flexible for adding new vector kinds in the future. We will want to add boolean vectors & hamming distance in the future. These are important for image search.VectorValues
always returnBytesRef
for everything and provide similarity that knows the encoding and can decode thefloat[]
if that is the encoding. This feels closer to what we do with doc values (takingDoubleDocValuesField
&DoubleValuesSource
as prior art with how that interaction is done withNumericDocValues
).@jpountz @rmuir do y’all have opinions here?