question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Support BM25 parameters customization

See original GitHub issue

Would you consider to support customization of BM25 parameters? It would be very helpful for optimizing search relevance.

var k = 1.2; // Term frequency saturation point. Recommended values are between 1.2 and 2.
var b = 1.2; // Length normalization impact. Recommended values are around 0.75.
var d = 0.5; // BM25+ frequency normalization lower bound. Recommended values are between 0.5 and 1.

Issue Analytics

  • State:closed
  • Created a year ago
  • Comments:6 (4 by maintainers)

github_iconTop GitHub Comments

2reactions
rolftimmermanscommented, Jun 21, 2022

Here are some of my notes that may help in documenting the parameters, if they’re exposed. May need a bit of a rewrite 😃

This article is also helpful for understanding k and b: https://opensourceconnections.com/blog/2015/10/16/bm25-the-next-generation-of-lucene-relevation/

k is the BM25 term frequency saturation point.

  • Higher term frequencies means that a document has higher relevance, but BM25 makes sure that the increase in relevance quickly diminishes. Mental model: a document with 3 occurrences of a term is a better match than a document with 1 occurrence, but it’s not 3x better.
  • Higher values increase the relevance difference between documents with higher/lower term frequencies.
  • Lower values reduce the relevance difference between documents with higher/lower term frequencies.
  • Default is 1.2.
  • Recommended values are between 1.2 and 2.
  • Setting this to 0 or a negative number is invalid (could be validated automatically?).

b is the BM25 length normalization impact.

  • A document with a longer field length needs to have a slightly higher term frequency to achieve the same relevance as a document with a shorter field length.
  • Higher values increase the weight that field length has on scoring.
  • Lower values decrease the weight that field length has on scoring.
  • Setting this to 0 disables the field length having an effect on scoring altogether (not recommended).
  • Default value is 0.7.
  • Recommended values are around 0.75.
  • Setting this to negative values is invalid (could be validated automatically?).

d (actually δ) is the BM25+ frequency normalization lower bound.

  • Addresses a deficiency in BM25. Long fields which do match the query term are scored unfairly by BM25.
  • Increasing this parameter increases the minimum relevance of one occurrence of a search term regardless of its (very long) field length.
  • Decreasing this parameter effectively has the effect of penalising long fields with few/one term occurrence.
  • Default value is 0.5.
  • Recommended values are between 0.5 and 1.0.
  • Setting this to 0 disables this feature (not recommended).
  • Setting this to a negative number is invalid (could be validated automatically?).
1reaction
rayhsiehcommented, Jun 22, 2022

@lucaong Thank you for planning this request. While working on the search function on my dataset, it is very flexible for me the add language-specific tokenizer. I don’t have other recommendation at this point since it is already met what I need.

Read more comments on GitHub >

github_iconTop Results From Across the Web

Practical BM25 - Part 3: Considerations for Picking b and k1 in ...
Learn about best practices and other considerations before modifying the b and k1 values of the BM25 similarity ranking (relevancy) ...
Read more >
How to choose the OKAPI BM25 parameters : b and k1
A simple way of tuning the parameters is to adjust them and then evaluate their performance impact. If the results are not satisfying,...
Read more >
BM25 Reference - Vespa Documentation
: A parameter used to limit how much a single query term can affect the score for document D. With a higher value...
Read more >
New BM25 functions and IDF operators in custom rankers
Please note that all 3 provide float values between 0..1 and they work only inside sum() just as tf_idf works. mysql> ...
Read more >
Configure relevance scoring - Azure Cognitive Search
Set BM25 parameters · Formulate a Create or Update Index request as illustrated by the following example. HTTP Copy · Set "b" and...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found