Resolve tiny differences between Anserini and Pyserini on MS MARCO: query iteration order
See original GitHub issueIf we look at the Python replications: https://github.com/castorini/pyserini/blob/master/docs/pypi-replication.md Compared against Anserini replications: e.g., https://github.com/castorini/anserini/blob/master/docs/experiments-msmarco-doc-leaderboard.md
We’ll note tiny differences - e.g., for MS MARCO doc, baselines - pyserini:
#####################
MRR @100: 0.2770296928568709
QueriesRanked: 5193
#####################
Compared to anserini:
#####################
MRR @100: 0.2770296928568702
QueriesRanked: 5193
#####################
Previously, we tracked it down issue #257
I’d like to fix it so get identical results moving forward - my proposed fix is a bit janky, but it’ll work: Let’s just store, in Python code, an array of integers corresponding to ids of the queries in the original queries file. When we’re iterating over the dataset in pyserini.search
, we just follow the order of the integers.
Slightly better, we introduce a new query iterator abstraction and hide this implementation detail in there. So the query iterator would take in the current dictionary, and an optional array holding the iteration order.
Thoughts @MXueguang? I was thinking you could work on this?
Issue Analytics
- State:
- Created 3 years ago
- Comments:15 (15 by maintainers)
Top GitHub Comments
i see. so query iterator takes
cur_dict
andorder_array
and yield (id, text) pairsClosed by #309