question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

sentence-transformers for large-scale search

See original GitHub issue

Hi, thanks fo this fantastic repo and its documentation!

I have a question: I am working on a research project on fact-verification in Czech and as the first step we are trying various approaches to document retrieval. Our corpus is Czech Wikipedia abstracts and we have a dataset of claim-Wikipedia ID pairs.

I’ve split my Wikipedia abstracts into sentences and have been trying to use sentence-transformers to get meaningful embeddings and do top-k search in the embedding space. I’ve experimented with mBERT embeddings, which gave me pretty underwhelming results (around 3 times worse than bm25). I tried teaching an xlm-roberta using make_multilingual.py script on Czech TED parallel data (also tried OPUS with no real gains), but this performed worse than base mBERT with a mean pooling layer.

The metric I use is a modified precision@k, such that each claim has 1 wiki_id and k = 10. (I’ve tried running mBERT for k up to 50, which increased precision from 0.09@k=10 to 0.16@k=50, BM25 without much pre-processing nets 0.3@k=10)

  1. Does it make sense for xlm-roberta after teacher-student training to perform worse than mBERT without any?
  2. Do you suppose extracting embeddings from transformers can work for large-scale IR, or would I need to get more creative with the pretraining tasks, e.q. https://arxiv.org/pdf/2002.03932.pdf ?

Appreciate any response!

Issue Analytics

  • State:open
  • Created 3 years ago
  • Comments:11 (7 by maintainers)

github_iconTop GitHub Comments

10reactions
nreimerscommented, Jul 30, 2020

Hi @MichalPitr I think the current models are not the best / not suited for this task. They were trained on sentence level, however, for retrieval, you usually want to index paragraphs.

Further, the current models are rather “recall” oriented, i.e., they have a low chance to miss something. For IR, you usually want precision oriented models, like BM25.

Currently we plan to release soon several examples (+pre-trained models) for information retrieval. The models we have so far already beat BM25 by quite a margin (on English on data sets like MS Marco). However, they can still be made better 👍

If you have suitable training data in the format (query, relevant_passage), I can recommend to have a look at this: https://github.com/UKPLab/sentence-transformers/blob/master/sentence_transformers/losses/MultipleNegativesRankingLoss.py

This gave for us so far the best performance when we trained models for IR.

Current timeline:

  • I will release soon code examples for Information Retrieval on sentence level using the Quora duplicate questions dataset (given sentence, find duplicate sentences in a collection of 400k sentences)
  • More complex training procedures for passage retrieval, trained on MS Marco and Natural Questions dataset, will take more time until we have everything setup and can release code + models.

Best Nils Reimers

5reactions
nreimerscommented, Jul 30, 2020

Hi @MichalPitr Not yet.

In this paper: https://arxiv.org/abs/2002.03932

They made quite good experiences with Inverse Cloze Task, much better than with masked language model.

So I think it will be worth wile to test this, especially as the implementation is straight forward.

In September, a new Ph.D. will join my team who will be working on this (and related topics). We hope that we can then publish better pre-training strategies.

Best Nils Reimers

Read more comments on GitHub >

github_iconTop Results From Across the Web

Semantic Search — Sentence-Transformers documentation
The idea behind semantic search is to embed all entries in your corpus, whether they be sentences, paragraphs, or documents, into a vector...
Read more >
sentence-transformers/multi-qa-MiniLM-L6-cos-v1
This is a sentence-transformers model: It maps sentences & paragraphs to a 384 dimensional dense vector space and was designed for semantic search....
Read more >
How to choose a Sentence Transformer from Hugging Face
Models such as DALL-E, which generates unbelievable images from text prompts, or CLIP, that searches through massive scales of images with ...
Read more >
How to Build a Semantic Search Engine With Transformers ...
... a vector-based search engine with sentence transformers and Faiss. ... however, it can be very slow with a large dataset as it...
Read more >
sentence-transformers/fast_clustering.py at master - GitHub
This is a more complex example on performing clustering on large scale dataset. This examples find in a large set of sentences local...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found