Large DocValues field retrieval test
See original GitHub issueI am working on LUCENE-8374 and I would like to have independent verification of my findings. I have looked through luceneutil
and it does not seem to have a test for my case. I could try and add a fitting test and make a pull request, but I fear that I might be doctoring it to fit my view on the issue, so I ask for input on this.
What I have found is a performance regression introduced in Lucene 7+ with the sequential API for DocValues. The regression is (not surprisingly) for random access of DocValues fields and gets progressively worse as segment size (measured in #documents) grows. The simplest real-world case is document retrieval as part of a search, with fields that are not stored but which has DocValues enabled.
Compounding factors are
- The field is an integer Numeric (int/long/date)
- The field is DENSE (between 6% and 99% of the documents has a value for the field; the closer to 99% the worse the regression)
- The number of documents in a segment (
maxDoc
is a usable indicator)
The pathological case is a 100M+ doc index merged down to 1 segment (which we have in abundance), but the effect is measurable even with segments of 2M documents with randomized searches and small result sets.
Does this sound like something that is covered by any of the tests in luceneutil
? Maybe the taxi-corpus? I looked, but it seems to use floating point numbers, which aren’t as affected.
Issue Analytics
- State:
- Created 5 years ago
- Comments:19 (1 by maintainers)
Top GitHub Comments
I think sorted queries are interesting for that reason: it is a realistic use-case, yet secondary processing is lightweight as most of the time the new document will not be competitive so the only processing that is done with the value is a long comparison.
FYI: I am currently busy with in-house work. I’ll get back to the current issue later.
Regarding the abuse thing, I have posted “DocValues, retrieval performance and policy” on the developer mailing-list to get a better understanding.