Bug in bm25 ranking function with more than one term
See original GitHub issueI think I’ve found a bug in the bm25 implementation:
The specific problem is here:
This code is supposed to extract the term_frequency
and docs_with_term
for term i
and column j
. BUT… I don’t think the array pointer arithmetic here is correct. In particular, with more than one term I seem to be getting the wrong results.
After quite a lot of digging around, I think I’ve prepared an example that illustrates the problem. My code is here: https://gist.github.com/simonw/e0b9156d66b41b172a66d0cfe32d9391
I created a modified version of the bm25 function which outputs debugging information, then ran some sample searches through it. The output illustrating the problem is this:
search = dog cat
============
('both of them', 'both dog dog and cat here')
[2, 2, 5, 4, 5, 3, 6, 0, 1, 1, 2, 4, 2, 0, 1, 1, 1, 3, 2]
term_count=2, col_count=2, total_docs=5
term (i) = 0, column (j) = 0
avg_length=4.0, doc_length=3.0
term_frequency_in_this_column=0.0, docs_with_term_in_this_column=1.0
term (i) = 0, column (j) = 1
avg_length=5.0, doc_length=6.0
term_frequency_in_this_column=2.0, docs_with_term_in_this_column=2.0
term (i) = 1, column (j) = 0
avg_length=4.0, doc_length=3.0
term_frequency_in_this_column=0.0, docs_with_term_in_this_column=1.0
term (i) = 1, column (j) = 1
avg_length=5.0, doc_length=6.0
term_frequency_in_this_column=0.0, docs_with_term_in_this_column=1.0
-0.438011195601579
That’s for a search for dog cat
against the following five documents:
CREATE VIRTUAL TABLE docs USING fts4(c0, c1);
INSERT INTO docs (c0, c1) VALUES ("this is about a dog", "more about that dog dog");
INSERT INTO docs (c0, c1) VALUES ("this is about a cat", "stuff on that cat cat");
INSERT INTO docs (c0, c1) VALUES ("something about a ferret", "yeah a ferret ferret");
INSERT INTO docs (c0, c1) VALUES ("both of them", "both dog dog and cat here");
INSERT INTO docs (c0, c1) VALUES ("not mammals", "maybe talk about fish");
The bug is illustrated by the very last section of the above example output, this bit:
term (i) = 1, column (j) = 1
avg_length=5.0, doc_length=6.0
term_frequency_in_this_column=0.0, docs_with_term_in_this_column=1.0
Here the output is showing that the document ('both of them', 'both dog dog and cat here')
was found to match the search for dog cat
- but that the statistics for the last term and column (so the term cat
in the column both dog dog and cat here
) have term_frequency_in_this_column
of 0.0.
This is incorrect! The word cat appears once in that column, so this value should be 1.0.
The bug is in the x = X_O + (3 * j * (i + 1))
line which calculates the offset within the matchinfo array.
Issue Analytics
- State:
- Created 5 years ago
- Comments:12 (6 by maintainers)
Top GitHub Comments
You suggested
x = X_O + ((i * term_count) + j) * 3
But I think it is
x = X_O + ((i * col_count) + j) * 3
and it looks like the issue on the linked repo agrees with that.3.8.1