Integration between Anserini and MatchZoo
See original GitHub issueThe basic idea is to transform the run.*
file into the mz format required by Matchzoo. The target is to run the ranking and reranking models in a shell script. I hope they are all written in python.
There are several points I am not clear:
- How to get the raw documents from the document ID in the
run.*
file? One option is pyjnius + Anserini such as
import os
os.environ['CLASSPATH'] = "/home/larumuga/Anserini/target/anserini-0.0.1-SNAPSHOT.jar"
import jnius_config
from jnius import autoclass
JString = autoclass('java.lang.String')
index_test = autoclass('io.anserini.index.IndexUtils')
indexes = index_test(JString('/home/w85yang/Anserini/lucene-index-all.car18'))
print(indexes.getRawDocument(JString('7250e1b901bb59853deb38a452f9009999e790ae')
-
How to define train/test set? There should be no train/test splits in most TREC tracks since there are IR tasks but not ML tasks. My plan to allow user to DIY their own train/test split. For example, we can train on MB11 and test on MB13, or train on Robust04 and test on CORE17.
-
How to select negative samples? Some tracks provides negative samples like CORE17 but some only provide the relevant documents and all other documents are irrelevant like CAR17. We need to sample them. Can we use the query-doc pairs from
run.*
instead ofqrels.*
file as the training and test data for reranking? In CAR 17 I did both. -
How to select sentences? For most tracks, the documents are long texts, which will cause a big efficiency problem in neural network reranking. One basic approach to solve this is to select some representative sentences for each document by the tfidf matching score.
Issue Analytics
- State:
- Created 5 years ago
- Comments:16 (13 by maintainers)
Top GitHub Comments
Hi, @searchivarius. Thanks for your interest in this work.
This integration is between Anserini and Matchzoo v1.0. So you can find the script in the MatchZoo v1.0 here: https://github.com/NTMC-Community/MatchZoo/tree/1.0
I would suggest you use my repo since I have some updates based on the code above so that we can apply MatchZoo to Robust04 and Tweet datasets: https://github.com/Victor0118/MatchZoo/tree/rerank/data/robust04 and https://github.com/Victor0118/MatchZoo/tree/rerank/data/tweets
Hi @arjenpdevries !
BTW, http://desires.dei.unipd.it/papers/paper10.pdf reports some nice numbers for Robust04.