Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Reproducing msmarco-distilbert-dot-v5 training

See original GitHub issue

Hey there,

My team and I have been really amazed at the latest results that are displayed by the msmarco-distilbert-dot-v5 (HF card available here) model on the MS MARCO passage dev set, it’s quite astonishing!

I’ve been able to use the model for inference and obtained an MRR@10 similar to yours 😄 and the next step for me is to reproduce the training of that model in order to replicate it with a different training set.

Following the script given in the HF model car here, I’ve stumbled upon two issues:

file msmarco-hard-negatives-v6.jsonl.gz seems to have a been a local file that I can’t find exactly in the HF datasets, the closest I found was the msmarco-hard-negatives, that comprises of a score file (cross encoder of a mini lm 6 v2 based model scores for a bunch of (qid, pid) as a Dict[int, Dict[int, float]]) and a mined negatives files as a Dict[int, Dict[str, List[int]]. From what I’ve figured the training script takes an intermediary mined negatives file that has the cross-encoder scores in the same structure, which would be combination of both aforementioned files, so what I did is I manually combined both to be able to run the script (you can also tell me if I was misguided at this step, but seems ok to me and this point is basically resolved). => ✔️
The script was apparently launched (cf last line of the script) with arguments: --model final-models/distilbert-margin_mse-sym_mnrl-mean-v1, and if I understand correctly, this means that the script uses a pretrain distilbert model and I can’t seem to find it in the model Hub or anything, is there anyway for you to tell me how to get it? => ❌ 😢

Thank you in advance for you precious help,

Peace ☮️ 🤙

PS: also thank you for the amazing work you and your team have done on this library and congratulations on all the research results you’ve obtained so far.

Issue Analytics

State:
Created 2 years ago
Reactions:1
Comments:6 (2 by maintainers)

Top GitHub Comments

1reaction

nreimerscommented, Jan 6, 2022

No, you can just launch it with the default parameters and the distilbert-base-uncased model

1reaction

nreimerscommented, Jan 6, 2022

You can find a clean and nice version of the training here: https://github.com/UKPLab/sentence-transformers/blob/master/examples/training/ms_marco/train_bi-encoder_margin-mse.py

It will produce a model with similar performance.

Otherwise for the specific model training was done in two iterations:

Start with the distilbert-base-uncased model and train with MarginMSE + MultipleNegativesRankingLoss
Use the model from 1) and mine hard negatives. Score them all using a cross-encoder
Continue training of that model with margin-mse loss and the specific hard negatives.

But the above linked script will produce a model that is on par

Top Results From Across the Web

sentence-transformers/msmarco-distilbert-dot-v5 - Hugging Face

msmarco -distilbert-dot-v5. This is a sentence-transformers model: It maps sentences & paragraphs to a 768 dimensional dense vector space and was designed ...

MSMARCO Models — Sentence-Transformers documentation

MS MARCO is a large scale information retrieval corpus that was created based ... New models trained with MarginMSE loss trained: msmarco-distilbert-dot-v5 ......

Semantic search ️ | Kaggle

Objective: Here, we explore on how semantic search is used for information retrieval to search across millions of records in less than a...

arXiv:2211.07624v1 [cs.CL] 14 Nov 2022

ings over an index of representative training ... msmarco-bert-base-dot-v5. Page 14. msmarco-distilbert-cos-v58, all-roberta-large-v19.

Using Commonsense Knowledge to Answer Why-Questions

Ω = RERANKED-COMET (pre-trained MS-. MARCO) We start by using an off-the-shelf pretrained ranking model: the msmarco- distilbert-dot-v5 model available on ...