Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Efficiently compute BM25 scores between a collection of queries and documents

See original GitHub issue

Hi,

I have a number of X queries and a collection of Y documents, and I’d like to compute the BM25 score for each pair of them (resulting in a X by Y matrix). And I need to do this many times, each time with different queries and documents.

Is there any way to do this efficiently? The only function for computing BM25 score that I could find (https://github.com/castorini/pyserini/blob/48126de83ee54f8f77592d3d5b5386ad0a2ddf71/pyserini/index/_base.py#L440) calculates the score for a single pair, and cannot be parallelized. (It caused a deadlock when I use multiprocessing on top of it — probably related to #15 )

Is there any way I can speed up the computation by somehow processing this in batch? For instance, can I do search within a specific subset of the documents by supplying the docids? If so, I can solve this by doing a batch search within a restricted index containing only those Y documents.

Issue Analytics

State:
Created 2 years ago
Comments:12 (7 by maintainers)

Top GitHub Comments

2reactions

MXueguangcommented, Mar 23, 2021

@ccsasuke Implement docid list in Anserini should be the best option to improve efficiency. But with current Pyserini, see if this vectorizer API can help you save some time of computing the scores. https://github.com/castorini/pyserini/blob/dd7d18c4e99a57dc1923cd3c7d5ff0eee45bb9d8/pyserini/vectorizer/_base.py#L139 The BM25Vectorizer contains two methods:

get_vectors(self, docid_list, norm=None), calculate the BM25 weights for documents
get_query_vector(query) gives the term frequency vector for query. You can use get_vectors to get a sparse matrix M1 for your X documents, with shape(|X|, d). and use get_query_vector to get a sparse matrix M2 for your Y queries, with shape(|Y|, d). By dot product these two sparse matrices, you will get the BM25 score matrix in shape(|X|, |Y|). This process may save your time of calculating the score one by one.

0reactions

MXueguangcommented, Mar 23, 2021

plus it’s not batched at the back end

emm, yeah, I was hoping the get_document_vector has same efficiency to compute_query_document_score. but seems not 😦 and even worse. maybe we have to add feature in Anserini end then…

Top Results From Across the Web

Efficient query processing techniques for next-page retrieval

In top-k ranked retrieval the goal is to efficiently compute an ordered list of the highest scoring k documents according to some stipulated ......

Practical BM25 - Part 2: The BM25 Algorithm and its Variables

BM25 is the default similarity ranking (relevancy) algorithm in Elasticsearch. Learn more about how it works by digging into the equation ...

BM25 | Build your Own NLP Based Search Engine Using BM25

BM25 is a simple Python package and can be used to index the data, tweets based on the search query. It works on...

CS246 Project 2 - UCLA

BM25 is the default similarity ranking function used by Elasticsearch, which is known to work quite well for an article-length sized document corpus....

Faster and More Robust Top-k Document Retrieval

Query Evaluation; Dynamic Pruning; Efficiency; Web Search. ACM Reference Format: ... feature extraction and score computation per document of such.