question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Can I fed 500K documents in rank_bm25?

See original GitHub issue

Thanks for this awesome library.

I am curious to know whether rank_bm25 can handle 500K documents. Each document has around 1000 words.

Looking forward to your feedback. I want to use the following functionality with rank_bm25:

from rank_bm25 import BM25Okapi

corpus = [
    "Hello there good man!",
    "It is quite windy in London",
    "How is the weather today?"
]

tokenized_corpus = [doc.split(" ") for doc in corpus]
bm25 = BM25Okapi(tokenized_corpus)


query = "windy London"
tokenized_query = query.split(" ")

doc_scores = bm25.get_scores(tokenized_query)
result = bm25.get_top_n(tokenized_query, corpus, n=1)

print(result)

Issue Analytics

  • State:open
  • Created a year ago
  • Comments:6 (1 by maintainers)

github_iconTop GitHub Comments

1reaction
nashidcommented, Nov 17, 2022

@AmenRa I am also interested in this feature. Would try out retriv.

0reactions
AmenRacommented, Nov 17, 2022

Hi @ramsey-coding,

I have just released a new Python-based search engine called retriv. It only takes ~40ms to query 8M documents on my machine. If you try it, please, let me know if it works for your use case.

Read more comments on GitHub >

github_iconTop Results From Across the Web

rank-bm25 - PyPI
Rank-BM25 : A two line search engine. A collection of algorithms for querying a set of documents and returning the ones most relevant...
Read more >
Practical BM25 - Part 2: The BM25 Algorithm and its Variables
BM25 is the default similarity ranking (relevancy) algorithm in Elasticsearch. Learn more about how it works by digging into the equation ...
Read more >
BM25 Reference - Vespa Documentation
The bm25 rank feature implements the Okapi BM25 ranking function used to estimate the relevance of a text document given a search query....
Read more >
python - Question about Ranking of Documents using BM25
Basically you just need to iterate over your list of documents, for example like this: import pandas as pd from rank_bm25 import BM25Okapi ......
Read more >
Neural Question Answering Models with Broader Knowledge ...
present a multi-hop QA model that could efficiently navigate over the large text corpus. (over millions of documents) and reason over multiple text...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found