Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

How to reproduce ms marco dense embeddings models?

See original GitHub issue

Hi,

Recently, you have pushed by a large margin the MRR score of Ms Marco dense embeddings scores. There is a script to train the model in this repo (https://github.com/UKPLab/sentence-transformers/blob/master/examples/training/ms_marco/train_bi-encoder.py) and I am wondering if there are other things you use to push results?

I am asking as I work on a large private dataset (700K q/a), I am using the scripts here (bi and cross), cross encoder provides an accuracy score 12 points better than bi-encoder. In end to end setup, Elasticsearch (no boost, no trick, just stemming) + cross encoding provides much better results than dense embedding + cross encoding (with brute force search, so no issue with approx search).

I have tried tricks to increase the batch size of the bi-encoder (24Gb GPU RAM), for instance with “gradient checkpointing”, I can go from 48 to 256 examples per batch (btw it would be a nice add it to the bi encoder script, training will be around 20-30% slower but who cares 😃 ), it provides some improvement (+3 points of classification accuracy), but it does not seem enough to come close to cross encoding results. I am quite sure that the new release of deepspeed can help to increase batch size, probably around 400 something examples by off loading stuff to the host memory, but I am also quite sure that it won’t be enough to provide what is required to be useful in end to end setup. Same with playing with the learning rate, adam hyper params, etc.

XP of the last 10 days (with different ratio pos / neg and different batch size, LR, etc.), group on the top are cross encoders, group on the bottom are bi encoders. Best bi encoder (pink) is the one with gradient checkpointing and very large batch size.

In rocketqa paper, they list some tricks, regarding negative examples generation: generate negative with dense embedding instead of BM25 (they say the same in ANCE paper) and filtering too good negative with cross encoding, are you using some of those?

Do you apply some tricks on the dimension reduction like applying layer norm at the end like in https://arxiv.org/pdf/2012.15156.pdf / https://arxiv.org/pdf/2005.00181.pdf?

In my case, too good negative are filtered by rules (does the question appear in the title / text, etc.), filtering with cross encoder changes almost nothing to both cross encoder and bi encoder models (before we used those strict rules, filtering with a cross encoder model had a large impact on scores, and now it has almost no effect on accuracy score, so the rules are probably good enough).

Also I noticed that you are using a 1 positive for 4 negative examples on cross encoder but not for bi-encoder, have you noticed a difference in behavior? (on rocket qa they are using 1-4 ratio for dense embeddings).

So you got the idea, is there some secret sauce?

FWIW

...
    if model_args.bi_encoder:
        word_embedding_model = models.Transformer(
            model_args.model_name_or_path,
            max_seq_length=model_args.seq_len,
            model_args={"gradient_checkpointing": model_args.gradient_checkpointing},
        )
        pooling_model = models.Pooling(word_embedding_model.get_word_embedding_dimension())
        dense_model = models.Dense(
            in_features=pooling_model.get_sentence_embedding_dimension(),
            out_features=256,
            activation_function=nn.Tanh(),
        )
        model = SentenceTransformer(modules=[word_embedding_model, pooling_model, dense_model])
        train_loss = losses.MultipleNegativesRankingLoss(model=model)
        wandb.watch(model)

        model.fit(
            train_objectives=[(train_dataloader, train_loss)],
            evaluator=evaluator,
            epochs=model_args.num_train_epochs,
            warmup_steps=model_args.warmup_steps,
            evaluation_steps=model_args.eval_steps,
            optimizer_params={"lr": model_args.learning_rate, "eps": 1e-6, "correct_bias": False},
            output_path=model_args.output_dir,
            use_amp=model_args.fp16,
        )
    else:
        model = CrossEncoder(model_args.model_name_or_path, num_labels=1, max_length=model_args.seq_len)

        wandb.watch(model.model)

        model.fit(
            train_dataloader=train_dataloader,
            evaluator=evaluator,
            epochs=model_args.num_train_epochs,
            evaluation_steps=model_args.eval_steps,
            optimizer_params={"lr": model_args.learning_rate, "eps": 1e-6, "correct_bias": False},
            warmup_steps=model_args.warmup_steps,
            output_path=model_args.output_dir,
            use_amp=model_args.fp16,
        )

Issue Analytics

State:
Created 3 years ago
Comments:14 (11 by maintainers)

Top GitHub Comments

4reactions

djstrongcommented, May 6, 2021

Thank you! I have added wandb logging to EmbeddingSimilarityEvaluator: https://github.com/djstrong/sentence-transformers/tree/wandb if someone interested.

1reaction

nreimerscommented, Feb 18, 2021

@pommedeterresautee No, I did not imply to train with Asym class. I added this to the docs to make it more clear if you have a symmetric (query & doc have the same amount of content) or if you have an asymmetric case (query is short, doc is longer).

Currently we evaluate what the best method is for the asymmetric case. But so far we don’t have a conclusion. Once we have a conclusion, and know hot to train for asymmetric cases (short query, long doc), the docs will be updated with the respective recommendations.

Top Results From Across the Web

MS MARCO - Microsoft Open Source

date type MRR@100 (Dev) MRR@100 (Eval) 2022/02/08 🏆 full ranking 0.512 0.446 2021/07/14 🏆 full ranking 0.500 0.440 2021/06/24 🏆 full ranking 0.496 0.436

How to deploy NLP: Text Embeddings and Vector Search

We will use a public dataset from the MS MARCO Passage Ranking ... use a model to produce text embeddings, and then run...

A Multilingual Version of MS MARCO Passage Ranking Dataset

Deep learning models applied to Information Retrieval (IR) tasks unfolds in two main approaches: as dense retrieval models or re-ranking documents retrieved ...

Nils Reimers on Twitter: " State-of-the-art MS MARCO Models ...

Background The quality of sentence embedding models can be increased easily via: Larger, more diverse training data Larger batch sizes However, training on ......

Domain Adaptation for Memory-Efficient Dense Retrieval - arXiv

The authors showed these binary embedding models significantly ... like MS MARCO, but are then used for some custom dataset for which no ......