Tiny difference in MRR@10 for MS MARCO passage using pyserini.search
See original GitHub issuehey @qguo96 - here’s what I’m getting:
$ python -m pyserini.search --topics msmarco_passage_dev_subset --index ms-marco-passage --output runs/run.msmarco-passage.2.txt --msmarco --bm25
$ python tools/scripts/msmarco/msmarco_eval.py collections/msmarco-passage/qrels.dev.small.tsv runs/run.msmarco-passage.2.txt
#####################
MRR @10: 0.18741227770955543
QueriesRanked: 6980
#####################
Although note that the MRR@10 here (https://github.com/castorini/anserini/blob/master/docs/experiments-msmarco-passage.md) is:
#####################
MRR @10: 0.18741227770955546
QueriesRanked: 6980
#####################
Note the tiny difference in the final digit. Would you mind looking into this? Diff the actual output from both cases and try to see what’s going on? Just wanted to make sure this wasn’t a bug…
Issue Analytics
- State:
- Created 3 years ago
- Comments:8 (7 by maintainers)
Top Results From Across the Web
Jimmy Lin on Twitter: "PSA for those working on MS MARCO ...
PSA for those working on MS MARCO: apparently, shuffling the order of the queries ... Tiny difference in MRR@10 for MS MARCO passage...
Read more >Pyserini: A Python Toolkit for Reproducible Information ...
ABSTRACT. Pyserini is a Python toolkit for reproducible information retrieval research with sparse and dense representations.
Read more >Pyserini: An Easy-to-Use Python Toolkit to Support Replicable ...
Anserini builds on the open-source Lucene search library and was ... In this section, we will mostly use the MS MARCO passage.
Read more >MS MARCO Chameleons - Microsoft
MS MARCO passage retrieval dataset, the performance improve- ments gained over the past two years is impressive. For instance, the best run submitted...
Read more >Improving Query Representations for Dense ... - Hang Li
queries), DL HARD, and MS MARCO Passage Ranking V1 Dev set (6,980 queries). For direct comparison with the ANCE-PRF model, we follow the...
Read more >Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start FreeTop Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Top GitHub Comments
Agreed, but for replicability purposes we’d ideally want results to be exactly the same, and if they’re not, we should at least understand why.
(Sorry to interject!) Since the difference is 3e-17, isn’t the problem rather about the unrounded precision of the eval script, much more than anything else? Differences due to order seem expected (if only, in part, with hindsight!) and they’re in fact smaller than I’d have guessed personally.