Incomplete fulltext search results
See original GitHub issueJabRef version
5.5 (latest release)
Operating system
Windows
Details on version and operating system
Windows 10 21H2
Checked with the latest development build
- I made a backup of my libraries before testing the latest development version.
- I have tested the latest development version and the problem persists
Steps to reproduce the behaviour
When using the fulltext search with a simple single-keyword query, e.g. test
, I only get partial results and a subset of expected entries containing the text test
is not displayed in the search results. When I open the JabRef’s Lucene index in Luke and execute the same query (content:test
), it returns all related entries including those that are missing in JabRef’s search results.
The library in which I experience this has 400 entries. When I create a new library and add only one of the missing entries, the fulltext search returns it as expected. When I delete large portions (e.g. 350 entries) from my 400-entry library, that missing search result also starts to appear - this does not seem related to deleting a specific (potentially problematic) entry, as it starts to appear after different random selections of entries are removed. There’s also no specific threshold library size that triggers this behavior - I was able to make the result appear after cutting the library randomly down to ~40 - 70 entries.
Appendix
No response
Issue Analytics
- State:
- Created a year ago
- Comments:13 (8 by maintainers)
Top GitHub Comments
Yes it is and we had quite some discussion when implementing it. The problem here is twofold:
I think both these issues can be solved by switching to lucene for all searches. Metadata-results can be weighted using lucene as they would be using the same querries and we can use the overall lucene score to sort the entry table. (My wish would then be to also change the display of the fulltext-search results and show them directly in the table instead of the tab in the entry editor.)
Thank you, that helped me figure out the problem. A search string of e.g.
test
results in the parsed Lucene querypath:test content:test pageNumber:test modified:test annotations:test
at https://github.com/JabRef/jabref/blob/7d4916ead08e340c65dd956286ae22c44ea8cc48/src/main/java/org/jabref/logic/pdf/search/retrieval/PdfSearcher.java#L69-L70The problem here is
maxHits
, which is hardcoded to5
in the search rules, e.g. at https://github.com/JabRef/jabref/blob/7d4916ead08e340c65dd956286ae22c44ea8cc48/src/main/java/org/jabref/model/search/rules/ContainBasedSearchRule.java#L97I haven’t worked with Lucene in a long while, but it seems to me that the limit applies to each field separately, so the parsed query from above can yield 25 entries at most. Usual text queries don’t match the
pageNumber
ormodified
fields, yielding 15 results max, which I can also confirm from my testing.Now the question is, is this limitation on purpose?
This certainly prevents me from using JabRef for my use-case: finding all relevant entries out of all (or a subgroup) of entries, that contain e.g. a specific keyword. Or more generically: doing fulltext-based literature research within a library. Currently this only allows to answer whether there is any or no relevant entry.