Efficiently compute BM25 scores between a collection of queries and documents
See original GitHub issueHi,
I have a number of X
queries and a collection of Y
documents, and I’d like to compute the BM25 score for each pair of them (resulting in a X by Y
matrix). And I need to do this many times, each time with different queries and documents.
Is there any way to do this efficiently? The only function for computing BM25 score that I could find (https://github.com/castorini/pyserini/blob/48126de83ee54f8f77592d3d5b5386ad0a2ddf71/pyserini/index/_base.py#L440) calculates the score for a single pair, and cannot be parallelized. (It caused a deadlock when I use multiprocessing on top of it — probably related to #15 )
Is there any way I can speed up the computation by somehow processing this in batch?
For instance, can I do search within a specific subset of the documents by supplying the docids? If so, I can solve this by doing a batch search within a restricted index containing only those Y
documents.
Issue Analytics
- State:
- Created 2 years ago
- Comments:12 (7 by maintainers)
Top GitHub Comments
@ccsasuke Implement docid list in Anserini should be the best option to improve efficiency. But with current Pyserini, see if this vectorizer API can help you save some time of computing the scores. https://github.com/castorini/pyserini/blob/dd7d18c4e99a57dc1923cd3c7d5ff0eee45bb9d8/pyserini/vectorizer/_base.py#L139 The BM25Vectorizer contains two methods:
get_vectors(self, docid_list, norm=None)
, calculate the BM25 weights for documentsget_query_vector(query)
gives the term frequency vector for query. You can useget_vectors
to get a sparse matrix M1 for your X documents, with shape(|X|, d). and useget_query_vector
to get a sparse matrix M2 for your Y queries, with shape(|Y|, d). By dot product these two sparse matrices, you will get the BM25 score matrix in shape(|X|, |Y|). This process may save your time of calculating the score one by one.emm, yeah, I was hoping the
get_document_vector
has same efficiency tocompute_query_document_score
. but seems not 😦 and even worse. maybe we have to add feature in Anserini end then…