question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Large DocValues field retrieval test

See original GitHub issue

I am working on LUCENE-8374 and I would like to have independent verification of my findings. I have looked through luceneutil and it does not seem to have a test for my case. I could try and add a fitting test and make a pull request, but I fear that I might be doctoring it to fit my view on the issue, so I ask for input on this.

What I have found is a performance regression introduced in Lucene 7+ with the sequential API for DocValues. The regression is (not surprisingly) for random access of DocValues fields and gets progressively worse as segment size (measured in #documents) grows. The simplest real-world case is document retrieval as part of a search, with fields that are not stored but which has DocValues enabled.

Compounding factors are

  • The field is an integer Numeric (int/long/date)
  • The field is DENSE (between 6% and 99% of the documents has a value for the field; the closer to 99% the worse the regression)
  • The number of documents in a segment (maxDoc is a usable indicator)

The pathological case is a 100M+ doc index merged down to 1 segment (which we have in abundance), but the effect is measurable even with segments of 2M documents with randomized searches and small result sets.

Does this sound like something that is covered by any of the tests in luceneutil? Maybe the taxi-corpus? I looked, but it seems to use floating point numbers, which aren’t as affected.

Issue Analytics

  • State:open
  • Created 5 years ago
  • Comments:19 (1 by maintainers)

github_iconTop GitHub Comments

1reaction
jpountzcommented, Sep 20, 2018

I think sorted queries are interesting for that reason: it is a realistic use-case, yet secondary processing is lightweight as most of the time the new document will not be competitive so the only processing that is done with the value is a long comparison.

0reactions
tokeecommented, Sep 24, 2018

FYI: I am currently busy with in-house work. I’ll get back to the current issue later.

Regarding the abuse thing, I have posted “DocValues, retrieval performance and policy” on the developer mailing-list to get a better understanding.

Read more comments on GitHub >

github_iconTop Results From Across the Web

Elasticsearch _source, doc_values and store Performance
In this blog post, the elasticsearch _source field is compared with stored fields and docvalues from a performance point of view.
Read more >
doc_values | Elasticsearch Guide [8.5]
Doc values are supported on almost all field types, with the notable exception of text and annotated_text fields. Doc-value-only fieldsedit · Numeric types, ......
Read more >
DocValues aka. Column Stride Fields in Lucene 4.0 - YouTube
Heavy Committing: DocValues aka. Column Stride Fields in Lucene 4.0Presented by Simon Willnauer, Apache Lucene PMCLucene 4.0 is on its way ...
Read more >
DocValues | Apache Solr Reference Guide 8.8.2
Field values retrieved during search queries are typically returned from stored values. However, non-stored docValues fields will be also returned along with ...
Read more >
DocValues jump tables in Lucene/Solr 8
Most fields are DocValues and they are heavily used for faceting, ... For a the large segment index tested above, the positive impact...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found