question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Resolve tiny differences between Anserini and Pyserini on MS MARCO: query iteration order

See original GitHub issue

If we look at the Python replications: https://github.com/castorini/pyserini/blob/master/docs/pypi-replication.md Compared against Anserini replications: e.g., https://github.com/castorini/anserini/blob/master/docs/experiments-msmarco-doc-leaderboard.md

We’ll note tiny differences - e.g., for MS MARCO doc, baselines - pyserini:

#####################
MRR @100: 0.2770296928568709
QueriesRanked: 5193
#####################

Compared to anserini:

#####################
MRR @100: 0.2770296928568702
QueriesRanked: 5193
#####################

Previously, we tracked it down issue #257

I’d like to fix it so get identical results moving forward - my proposed fix is a bit janky, but it’ll work: Let’s just store, in Python code, an array of integers corresponding to ids of the queries in the original queries file. When we’re iterating over the dataset in pyserini.search, we just follow the order of the integers.

Slightly better, we introduce a new query iterator abstraction and hide this implementation detail in there. So the query iterator would take in the current dictionary, and an optional array holding the iteration order.

Thoughts @MXueguang? I was thinking you could work on this?

Issue Analytics

  • State:closed
  • Created 3 years ago
  • Comments:15 (15 by maintainers)

github_iconTop GitHub Comments

1reaction
MXueguangcommented, Jan 9, 2021

i see. so query iterator takes cur_dict and order_array and yield (id, text) pairs

0reactions
lintoolcommented, Jan 14, 2021

Closed by #309

Read more comments on GitHub >

github_iconTop Results From Across the Web

Tiny difference in MRR@10 for MS MARCO passage ... - GitHub
So, we notice that the MRR is different: look at the last digit in the metric. One obvious difference is the iteration order...
Read more >
Pyserini: A Python Toolkit for Reproducible Information ...
ABSTRACT. Pyserini is a Python toolkit for reproducible information retrieval research with sparse and dense representations.
Read more >
Pyserini: An Easy-to-Use Python Toolkit to Support Replicable ...
ABSTRACT. Pyserini is an easy-to-use Python toolkit that supports replicable. IR research by providing effective first-stage retrieval in a ...
Read more >
(PDF) MS MARCO Chameleons: Challenging the MS MARCO ...
An important task in the online evaluation of rankers is using implicit user feedback for inferring preferences between rankers. Interleaving methods have been ......
Read more >
Anserini: Reproducible Ranking Baselines Using Lucene
We investigate query overlap between training and test sets from two ... of the art on two different tasks, which are TREC-CAR and...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found