sentence-transformers for large-scale search
See original GitHub issueHi, thanks fo this fantastic repo and its documentation!
I have a question: I am working on a research project on fact-verification in Czech and as the first step we are trying various approaches to document retrieval. Our corpus is Czech Wikipedia abstracts and we have a dataset of claim-Wikipedia ID pairs.
I’ve split my Wikipedia abstracts into sentences and have been trying to use sentence-transformers to get meaningful embeddings and do top-k search in the embedding space. I’ve experimented with mBERT embeddings, which gave me pretty underwhelming results (around 3 times worse than bm25).
I tried teaching an xlm-roberta using make_multilingual.py
script on Czech TED parallel data (also tried OPUS with no real gains), but this performed worse than base mBERT with a mean pooling layer.
The metric I use is a modified precision@k, such that each claim has 1 wiki_id and k = 10. (I’ve tried running mBERT for k up to 50, which increased precision from 0.09@k=10 to 0.16@k=50, BM25 without much pre-processing nets 0.3@k=10)
- Does it make sense for xlm-roberta after teacher-student training to perform worse than mBERT without any?
- Do you suppose extracting embeddings from transformers can work for large-scale IR, or would I need to get more creative with the pretraining tasks, e.q. https://arxiv.org/pdf/2002.03932.pdf ?
Appreciate any response!
Issue Analytics
- State:
- Created 3 years ago
- Comments:11 (7 by maintainers)
Hi @MichalPitr I think the current models are not the best / not suited for this task. They were trained on sentence level, however, for retrieval, you usually want to index paragraphs.
Further, the current models are rather “recall” oriented, i.e., they have a low chance to miss something. For IR, you usually want precision oriented models, like BM25.
Currently we plan to release soon several examples (+pre-trained models) for information retrieval. The models we have so far already beat BM25 by quite a margin (on English on data sets like MS Marco). However, they can still be made better 👍
If you have suitable training data in the format (query, relevant_passage), I can recommend to have a look at this: https://github.com/UKPLab/sentence-transformers/blob/master/sentence_transformers/losses/MultipleNegativesRankingLoss.py
This gave for us so far the best performance when we trained models for IR.
Current timeline:
Best Nils Reimers
Hi @MichalPitr Not yet.
In this paper: https://arxiv.org/abs/2002.03932
They made quite good experiences with Inverse Cloze Task, much better than with masked language model.
So I think it will be worth wile to test this, especially as the implementation is straight forward.
In September, a new Ph.D. will join my team who will be working on this (and related topics). We hope that we can then publish better pre-training strategies.
Best Nils Reimers