question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Reproducing msmarco-distilbert-dot-v5 training

See original GitHub issue

Hey there,

My team and I have been really amazed at the latest results that are displayed by the msmarco-distilbert-dot-v5 (HF card available here) model on the MS MARCO passage dev set, it’s quite astonishing!

I’ve been able to use the model for inference and obtained an MRR@10 similar to yours 😄 and the next step for me is to reproduce the training of that model in order to replicate it with a different training set.

Following the script given in the HF model car here, I’ve stumbled upon two issues:

  1. file msmarco-hard-negatives-v6.jsonl.gz seems to have a been a local file that I can’t find exactly in the HF datasets, the closest I found was the msmarco-hard-negatives, that comprises of a score file (cross encoder of a mini lm 6 v2 based model scores for a bunch of (qid, pid) as a Dict[int, Dict[int, float]]) and a mined negatives files as a Dict[int, Dict[str, List[int]]. From what I’ve figured the training script takes an intermediary mined negatives file that has the cross-encoder scores in the same structure, which would be combination of both aforementioned files, so what I did is I manually combined both to be able to run the script (you can also tell me if I was misguided at this step, but seems ok to me and this point is basically resolved). => ✔️
  2. The script was apparently launched (cf last line of the script) with arguments: --model final-models/distilbert-margin_mse-sym_mnrl-mean-v1, and if I understand correctly, this means that the script uses a pretrain distilbert model and I can’t seem to find it in the model Hub or anything, is there anyway for you to tell me how to get it? => ❌ 😢

Thank you in advance for you precious help,

Peace ☮️ 🤙

PS: also thank you for the amazing work you and your team have done on this library and congratulations on all the research results you’ve obtained so far.

Issue Analytics

  • State:open
  • Created 2 years ago
  • Reactions:1
  • Comments:6 (2 by maintainers)

github_iconTop GitHub Comments

1reaction
nreimerscommented, Jan 6, 2022

No, you can just launch it with the default parameters and the distilbert-base-uncased model

1reaction
nreimerscommented, Jan 6, 2022

You can find a clean and nice version of the training here: https://github.com/UKPLab/sentence-transformers/blob/master/examples/training/ms_marco/train_bi-encoder_margin-mse.py

It will produce a model with similar performance.

Otherwise for the specific model training was done in two iterations:

  1. Start with the distilbert-base-uncased model and train with MarginMSE + MultipleNegativesRankingLoss
  2. Use the model from 1) and mine hard negatives. Score them all using a cross-encoder
  3. Continue training of that model with margin-mse loss and the specific hard negatives.

But the above linked script will produce a model that is on par

Read more comments on GitHub >

github_iconTop Results From Across the Web

sentence-transformers/msmarco-distilbert-dot-v5 - Hugging Face
msmarco -distilbert-dot-v5. This is a sentence-transformers model: It maps sentences & paragraphs to a 768 dimensional dense vector space and was designed ...
Read more >
MSMARCO Models — Sentence-Transformers documentation
MS MARCO is a large scale information retrieval corpus that was created based ... New models trained with MarginMSE loss trained: msmarco-distilbert-dot-v5 ......
Read more >
Semantic search ️ | Kaggle
Objective: Here, we explore on how semantic search is used for information retrieval to search across millions of records in less than a...
Read more >
arXiv:2211.07624v1 [cs.CL] 14 Nov 2022
ings over an index of representative training ... msmarco-bert-base-dot-v5. Page 14. msmarco-distilbert-cos-v58, all-roberta-large-v19.
Read more >
Using Commonsense Knowledge to Answer Why-Questions
Ω = RERANKED-COMET (pre-trained MS-. MARCO) We start by using an off-the-shelf pretrained ranking model: the msmarco- distilbert-dot-v5 model available on ...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found