Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

sentence-transformers for large-scale search

See original GitHub issue

Hi, thanks fo this fantastic repo and its documentation!

I have a question: I am working on a research project on fact-verification in Czech and as the first step we are trying various approaches to document retrieval. Our corpus is Czech Wikipedia abstracts and we have a dataset of claim-Wikipedia ID pairs.

I’ve split my Wikipedia abstracts into sentences and have been trying to use sentence-transformers to get meaningful embeddings and do top-k search in the embedding space. I’ve experimented with mBERT embeddings, which gave me pretty underwhelming results (around 3 times worse than bm25). I tried teaching an xlm-roberta using make_multilingual.py script on Czech TED parallel data (also tried OPUS with no real gains), but this performed worse than base mBERT with a mean pooling layer.

The metric I use is a modified precision@k, such that each claim has 1 wiki_id and k = 10. (I’ve tried running mBERT for k up to 50, which increased precision from 0.09@k=10 to 0.16@k=50, BM25 without much pre-processing nets 0.3@k=10)

Does it make sense for xlm-roberta after teacher-student training to perform worse than mBERT without any?
Do you suppose extracting embeddings from transformers can work for large-scale IR, or would I need to get more creative with the pretraining tasks, e.q. https://arxiv.org/pdf/2002.03932.pdf ?

Appreciate any response!

Issue Analytics

State:
Created 3 years ago
Comments:11 (7 by maintainers)

Top GitHub Comments

10reactions

nreimerscommented, Jul 30, 2020

Hi @MichalPitr I think the current models are not the best / not suited for this task. They were trained on sentence level, however, for retrieval, you usually want to index paragraphs.

Further, the current models are rather “recall” oriented, i.e., they have a low chance to miss something. For IR, you usually want precision oriented models, like BM25.

Currently we plan to release soon several examples (+pre-trained models) for information retrieval. The models we have so far already beat BM25 by quite a margin (on English on data sets like MS Marco). However, they can still be made better 👍

If you have suitable training data in the format (query, relevant_passage), I can recommend to have a look at this: https://github.com/UKPLab/sentence-transformers/blob/master/sentence_transformers/losses/MultipleNegativesRankingLoss.py

This gave for us so far the best performance when we trained models for IR.

Current timeline:

I will release soon code examples for Information Retrieval on sentence level using the Quora duplicate questions dataset (given sentence, find duplicate sentences in a collection of 400k sentences)
More complex training procedures for passage retrieval, trained on MS Marco and Natural Questions dataset, will take more time until we have everything setup and can release code + models.

Best Nils Reimers

5reactions

nreimerscommented, Jul 30, 2020

Hi @MichalPitr Not yet.

In this paper: https://arxiv.org/abs/2002.03932

They made quite good experiences with Inverse Cloze Task, much better than with masked language model.

So I think it will be worth wile to test this, especially as the implementation is straight forward.

In September, a new Ph.D. will join my team who will be working on this (and related topics). We hope that we can then publish better pre-training strategies.

Best Nils Reimers

Top Results From Across the Web

Semantic Search — Sentence-Transformers documentation

The idea behind semantic search is to embed all entries in your corpus, whether they be sentences, paragraphs, or documents, into a vector...

sentence-transformers/multi-qa-MiniLM-L6-cos-v1

This is a sentence-transformers model: It maps sentences & paragraphs to a 384 dimensional dense vector space and was designed for semantic search....

How to choose a Sentence Transformer from Hugging Face

Models such as DALL-E, which generates unbelievable images from text prompts, or CLIP, that searches through massive scales of images with ...

How to Build a Semantic Search Engine With Transformers ...

... a vector-based search engine with sentence transformers and Faiss. ... however, it can be very slow with a large dataset as it...

sentence-transformers/fast_clustering.py at master - GitHub

This is a more complex example on performing clustering on large scale dataset. This examples find in a large set of sentences local...