Reproducing msmarco-distilbert-dot-v5 training
See original GitHub issueHey there,
My team and I have been really amazed at the latest results that are displayed by the msmarco-distilbert-dot-v5
(HF card available here) model on the MS MARCO passage dev set, it’s quite astonishing!
I’ve been able to use the model for inference and obtained an MRR@10 similar to yours 😄 and the next step for me is to reproduce the training of that model in order to replicate it with a different training set.
Following the script given in the HF model car here, I’ve stumbled upon two issues:
- file
msmarco-hard-negatives-v6.jsonl.gz
seems to have a been a local file that I can’t find exactly in the HF datasets, the closest I found was the msmarco-hard-negatives, that comprises of a score file (cross encoder of a mini lm 6 v2 based model scores for a bunch of (qid, pid) as aDict[int, Dict[int, float]]
) and a mined negatives files as aDict[int, Dict[str, List[int]]
. From what I’ve figured the training script takes an intermediary mined negatives file that has the cross-encoder scores in the same structure, which would be combination of both aforementioned files, so what I did is I manually combined both to be able to run the script (you can also tell me if I was misguided at this step, but seems ok to me and this point is basically resolved). => ✔️ - The script was apparently launched (cf last line of the script) with arguments:
--model final-models/distilbert-margin_mse-sym_mnrl-mean-v1
, and if I understand correctly, this means that the script uses a pretrain distilbert model and I can’t seem to find it in the model Hub or anything, is there anyway for you to tell me how to get it? => ❌ 😢
Thank you in advance for you precious help,
Peace ☮️ 🤙
PS: also thank you for the amazing work you and your team have done on this library and congratulations on all the research results you’ve obtained so far.
Issue Analytics
- State:
- Created 2 years ago
- Reactions:1
- Comments:6 (2 by maintainers)
Top Results From Across the Web
sentence-transformers/msmarco-distilbert-dot-v5 - Hugging Face
msmarco -distilbert-dot-v5. This is a sentence-transformers model: It maps sentences & paragraphs to a 768 dimensional dense vector space and was designed ...
Read more >MSMARCO Models — Sentence-Transformers documentation
MS MARCO is a large scale information retrieval corpus that was created based ... New models trained with MarginMSE loss trained: msmarco-distilbert-dot-v5 ......
Read more >Semantic search ️ | Kaggle
Objective: Here, we explore on how semantic search is used for information retrieval to search across millions of records in less than a...
Read more >arXiv:2211.07624v1 [cs.CL] 14 Nov 2022
ings over an index of representative training ... msmarco-bert-base-dot-v5. Page 14. msmarco-distilbert-cos-v58, all-roberta-large-v19.
Read more >Using Commonsense Knowledge to Answer Why-Questions
Ω = RERANKED-COMET (pre-trained MS-. MARCO) We start by using an off-the-shelf pretrained ranking model: the msmarco- distilbert-dot-v5 model available on ...
Read more >
Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free
Top Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
No, you can just launch it with the default parameters and the distilbert-base-uncased model
You can find a clean and nice version of the training here: https://github.com/UKPLab/sentence-transformers/blob/master/examples/training/ms_marco/train_bi-encoder_margin-mse.py
It will produce a model with similar performance.
Otherwise for the specific model training was done in two iterations:
But the above linked script will produce a model that is on par