question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Bug in bm25 ranking function with more than one term

See original GitHub issue

I think I’ve found a bug in the bm25 implementation:

https://github.com/coleifer/peewee/blob/a24b36da3a101458a854e6a4319f4bb8d8cb478f/playhouse/sqlite_ext.py#L1160-L1175

The specific problem is here:

https://github.com/coleifer/peewee/blob/a24b36da3a101458a854e6a4319f4bb8d8cb478f/playhouse/sqlite_ext.py#L1173-L1175

This code is supposed to extract the term_frequency and docs_with_term for term i and column j. BUT… I don’t think the array pointer arithmetic here is correct. In particular, with more than one term I seem to be getting the wrong results.

After quite a lot of digging around, I think I’ve prepared an example that illustrates the problem. My code is here: https://gist.github.com/simonw/e0b9156d66b41b172a66d0cfe32d9391

I created a modified version of the bm25 function which outputs debugging information, then ran some sample searches through it. The output illustrating the problem is this:

search = dog cat
============
('both of them', 'both dog dog and cat here')
[2, 2, 5, 4, 5, 3, 6, 0, 1, 1, 2, 4, 2, 0, 1, 1, 1, 3, 2]
term_count=2, col_count=2, total_docs=5
term (i) = 0, column (j) = 0
  avg_length=4.0, doc_length=3.0
  term_frequency_in_this_column=0.0, docs_with_term_in_this_column=1.0
term (i) = 0, column (j) = 1
  avg_length=5.0, doc_length=6.0
  term_frequency_in_this_column=2.0, docs_with_term_in_this_column=2.0
term (i) = 1, column (j) = 0
  avg_length=4.0, doc_length=3.0
  term_frequency_in_this_column=0.0, docs_with_term_in_this_column=1.0
term (i) = 1, column (j) = 1
  avg_length=5.0, doc_length=6.0
  term_frequency_in_this_column=0.0, docs_with_term_in_this_column=1.0
-0.438011195601579

That’s for a search for dog cat against the following five documents:

CREATE VIRTUAL TABLE docs USING fts4(c0, c1);
INSERT INTO docs (c0, c1) VALUES ("this is about a dog", "more about that dog dog");
INSERT INTO docs (c0, c1) VALUES ("this is about a cat", "stuff on that cat cat");
INSERT INTO docs (c0, c1) VALUES ("something about a ferret", "yeah a ferret ferret");
INSERT INTO docs (c0, c1) VALUES ("both of them", "both dog dog and cat here");
INSERT INTO docs (c0, c1) VALUES ("not mammals", "maybe talk about fish");

The bug is illustrated by the very last section of the above example output, this bit:

term (i) = 1, column (j) = 1
  avg_length=5.0, doc_length=6.0
  term_frequency_in_this_column=0.0, docs_with_term_in_this_column=1.0

Here the output is showing that the document ('both of them', 'both dog dog and cat here') was found to match the search for dog cat - but that the statistics for the last term and column (so the term cat in the column both dog dog and cat here) have term_frequency_in_this_column of 0.0.

This is incorrect! The word cat appears once in that column, so this value should be 1.0.

The bug is in the x = X_O + (3 * j * (i + 1)) line which calculates the offset within the matchinfo array.

Issue Analytics

  • State:closed
  • Created 5 years ago
  • Comments:12 (6 by maintainers)

github_iconTop GitHub Comments

1reaction
coleifercommented, Jan 6, 2019

You suggested x = X_O + ((i * term_count) + j) * 3

But I think it is x = X_O + ((i * col_count) + j) * 3 and it looks like the issue on the linked repo agrees with that.

0reactions
coleifercommented, Jan 7, 2019

3.8.1

Read more comments on GitHub >

github_iconTop Results From Across the Web

Practical BM25 - Part 2: The BM25 Algorithm and its Variables
BM25 is the default similarity ranking (relevancy) algorithm in Elasticsearch. Learn more about how it works by digging into the equation ...
Read more >
Simple BM25 extension to multiple weighted fields
This paper describes a simple way of adapting the BM25 ranking formula to deal with structured documents. In the past it has been...
Read more >
BoostNSift: A Query Boosting and Code Sifting Technique for ...
BoostNSift is comprised of three components: QueryBooster, which leverages the structure of bug reports to boost elements of the query for bug ...
Read more >
Index Pruning and Result Reranking: Effects on Ad-Hoc ...
ticular, we study a static index pruning method and two ... two such rankings will be negligible. ... Documents are ranked by Okapi...
Read more >
Using BM25 weighting and Cluster Shrinkage for Detecting ...
into two categories: duplicate detection based on textual bug ... reports, documents are ranked based on a bag-of-words retrieval function, and each term...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found