question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Integration between Anserini and MatchZoo

See original GitHub issue

The basic idea is to transform the run.* file into the mz format required by Matchzoo. The target is to run the ranking and reranking models in a shell script. I hope they are all written in python. There are several points I am not clear:

  1. How to get the raw documents from the document ID in the run.* file? One option is pyjnius + Anserini such as
import os
os.environ['CLASSPATH'] = "/home/larumuga/Anserini/target/anserini-0.0.1-SNAPSHOT.jar"

import jnius_config
from jnius import autoclass
JString = autoclass('java.lang.String')

index_test = autoclass('io.anserini.index.IndexUtils')
indexes = index_test(JString('/home/w85yang/Anserini/lucene-index-all.car18'))

print(indexes.getRawDocument(JString('7250e1b901bb59853deb38a452f9009999e790ae')
  1. How to define train/test set? There should be no train/test splits in most TREC tracks since there are IR tasks but not ML tasks. My plan to allow user to DIY their own train/test split. For example, we can train on MB11 and test on MB13, or train on Robust04 and test on CORE17.

  2. How to select negative samples? Some tracks provides negative samples like CORE17 but some only provide the relevant documents and all other documents are irrelevant like CAR17. We need to sample them. Can we use the query-doc pairs from run.* instead of qrels.* file as the training and test data for reranking? In CAR 17 I did both.

  3. How to select sentences? For most tracks, the documents are long texts, which will cause a big efficiency problem in neural network reranking. One basic approach to solve this is to select some representative sentences for each document by the tfidf matching score.

Issue Analytics

  • State:closed
  • Created 5 years ago
  • Comments:16 (13 by maintainers)

github_iconTop GitHub Comments

1reaction
Victor0118commented, Jun 26, 2019

Hi, @searchivarius. Thanks for your interest in this work.

This integration is between Anserini and Matchzoo v1.0. So you can find the script in the MatchZoo v1.0 here: https://github.com/NTMC-Community/MatchZoo/tree/1.0

I would suggest you use my repo since I have some updates based on the code above so that we can apply MatchZoo to Robust04 and Tweet datasets: https://github.com/Victor0118/MatchZoo/tree/rerank/data/robust04 and https://github.com/Victor0118/MatchZoo/tree/rerank/data/tweets

1reaction
lintoolcommented, Oct 20, 2018

Hi @arjenpdevries !

BTW, http://desires.dei.unipd.it/papers/paper10.pdf reports some nice numbers for Robust04.

Read more comments on GitHub >

github_iconTop Results From Across the Web

Integration between Anserini and MatchZoo #420 - GitHub
The basic idea is to transform the run.* file into the mz format required by Matchzoo. The target is to run the ranking...
Read more >
MatchZoo: A Learning, Practicing, and ... - ResearchGate
This paper proposes OpenMatch, an open-source toolkit for better employing and developing neural ranking methods by integrating state-of-the-art neural ...
Read more >
Flexible IR Pipelines with Capreolus
Anserini simplifies this integration by providing Pyserini [1], which is a wrapper for much of Anserini's functionality. Similarly, Pyndri ...
Read more >
Yang, Wei - End-to-end Neural Information Retrieval - OATD
In this work, we integrate Anserini (a state-of-the-art IR toolkit) with two different neural retrieval frameworks for end-to-end neural IR: (1) MatchZoo, ...
Read more >
MatchZoo: A Learning, Practicing, and Developing System for ...
A novel system, namely MatchZoo, to facilitate the learning, practicing and designing of neural text matching models and can help ...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found