Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Resolve tiny differences between Anserini and Pyserini on MS MARCO: query iteration order

See original GitHub issue

If we look at the Python replications: https://github.com/castorini/pyserini/blob/master/docs/pypi-replication.md Compared against Anserini replications: e.g., https://github.com/castorini/anserini/blob/master/docs/experiments-msmarco-doc-leaderboard.md

We’ll note tiny differences - e.g., for MS MARCO doc, baselines - pyserini:

#####################
MRR @100: 0.2770296928568709
QueriesRanked: 5193
#####################

Compared to anserini:

#####################
MRR @100: 0.2770296928568702
QueriesRanked: 5193
#####################

Previously, we tracked it down issue #257

I’d like to fix it so get identical results moving forward - my proposed fix is a bit janky, but it’ll work: Let’s just store, in Python code, an array of integers corresponding to ids of the queries in the original queries file. When we’re iterating over the dataset in pyserini.search, we just follow the order of the integers.

Slightly better, we introduce a new query iterator abstraction and hide this implementation detail in there. So the query iterator would take in the current dictionary, and an optional array holding the iteration order.

Thoughts @MXueguang? I was thinking you could work on this?

Issue Analytics

State:
Created 3 years ago
Comments:15 (15 by maintainers)

Top GitHub Comments

1reaction

MXueguangcommented, Jan 9, 2021

i see. so query iterator takes cur_dict and order_array and yield (id, text) pairs

0reactions

lintoolcommented, Jan 14, 2021

Closed by #309

Top Results From Across the Web

Tiny difference in MRR@10 for MS MARCO passage ... - GitHub

So, we notice that the MRR is different: look at the last digit in the metric. One obvious difference is the iteration order...

Pyserini: A Python Toolkit for Reproducible Information ...

ABSTRACT. Pyserini is a Python toolkit for reproducible information retrieval research with sparse and dense representations.

Pyserini: An Easy-to-Use Python Toolkit to Support Replicable ...

ABSTRACT. Pyserini is an easy-to-use Python toolkit that supports replicable. IR research by providing effective first-stage retrieval in a ...

(PDF) MS MARCO Chameleons: Challenging the MS MARCO ...

An important task in the online evaluation of rankers is using implicit user feedback for inferring preferences between rankers. Interleaving methods have been ......

Anserini: Reproducible Ranking Baselines Using Lucene

We investigate query overlap between training and test sets from two ... of the art on two different tasks, which are TREC-CAR and...