question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Efficiently compute BM25 scores between a collection of queries and documents

See original GitHub issue

Hi,

I have a number of X queries and a collection of Y documents, and I’d like to compute the BM25 score for each pair of them (resulting in a X by Y matrix). And I need to do this many times, each time with different queries and documents.

Is there any way to do this efficiently? The only function for computing BM25 score that I could find (https://github.com/castorini/pyserini/blob/48126de83ee54f8f77592d3d5b5386ad0a2ddf71/pyserini/index/_base.py#L440) calculates the score for a single pair, and cannot be parallelized. (It caused a deadlock when I use multiprocessing on top of it — probably related to #15 )

Is there any way I can speed up the computation by somehow processing this in batch? For instance, can I do search within a specific subset of the documents by supplying the docids? If so, I can solve this by doing a batch search within a restricted index containing only those Y documents.

Issue Analytics

  • State:open
  • Created 2 years ago
  • Comments:12 (7 by maintainers)

github_iconTop GitHub Comments

2reactions
MXueguangcommented, Mar 23, 2021

@ccsasuke Implement docid list in Anserini should be the best option to improve efficiency. But with current Pyserini, see if this vectorizer API can help you save some time of computing the scores. https://github.com/castorini/pyserini/blob/dd7d18c4e99a57dc1923cd3c7d5ff0eee45bb9d8/pyserini/vectorizer/_base.py#L139 The BM25Vectorizer contains two methods:

  • get_vectors(self, docid_list, norm=None), calculate the BM25 weights for documents
  • get_query_vector(query) gives the term frequency vector for query. You can use get_vectors to get a sparse matrix M1 for your X documents, with shape(|X|, d). and use get_query_vector to get a sparse matrix M2 for your Y queries, with shape(|Y|, d). By dot product these two sparse matrices, you will get the BM25 score matrix in shape(|X|, |Y|). This process may save your time of calculating the score one by one.
0reactions
MXueguangcommented, Mar 23, 2021

plus it’s not batched at the back end

emm, yeah, I was hoping the get_document_vector has same efficiency to compute_query_document_score. but seems not 😦 and even worse. maybe we have to add feature in Anserini end then…

Read more comments on GitHub >

github_iconTop Results From Across the Web

Efficient query processing techniques for next-page retrieval
In top-k ranked retrieval the goal is to efficiently compute an ordered list of the highest scoring k documents according to some stipulated ......
Read more >
Practical BM25 - Part 2: The BM25 Algorithm and its Variables
BM25 is the default similarity ranking (relevancy) algorithm in Elasticsearch. Learn more about how it works by digging into the equation ...
Read more >
BM25 | Build your Own NLP Based Search Engine Using BM25
BM25 is a simple Python package and can be used to index the data, tweets based on the search query. It works on...
Read more >
CS246 Project 2 - UCLA
BM25 is the default similarity ranking function used by Elasticsearch, which is known to work quite well for an article-length sized document corpus....
Read more >
Faster and More Robust Top-k Document Retrieval
Query Evaluation; Dynamic Pruning; Efficiency; Web Search. ACM Reference Format: ... feature extraction and score computation per document of such.
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found