How to reproduce ms marco dense embeddings models?
See original GitHub issueHi,
Recently, you have pushed by a large margin the MRR score of Ms Marco dense embeddings scores. There is a script to train the model in this repo (https://github.com/UKPLab/sentence-transformers/blob/master/examples/training/ms_marco/train_bi-encoder.py) and I am wondering if there are other things you use to push results?
I am asking as I work on a large private dataset (700K q/a), I am using the scripts here (bi and cross), cross encoder provides an accuracy score 12 points better than bi-encoder. In end to end setup, Elasticsearch (no boost, no trick, just stemming) + cross encoding provides much better results than dense embedding + cross encoding (with brute force search, so no issue with approx search).
I have tried tricks to increase the batch size of the bi-encoder (24Gb GPU RAM), for instance with “gradient checkpointing”, I can go from 48 to 256 examples per batch (btw it would be a nice add it to the bi encoder script, training will be around 20-30% slower but who cares 😃 ), it provides some improvement (+3 points of classification accuracy), but it does not seem enough to come close to cross encoding results. I am quite sure that the new release of deepspeed can help to increase batch size, probably around 400 something examples by off loading stuff to the host memory, but I am also quite sure that it won’t be enough to provide what is required to be useful in end to end setup. Same with playing with the learning rate, adam hyper params, etc.
XP of the last 10 days (with different ratio pos / neg and different batch size, LR, etc.), group on the top are cross encoders, group on the bottom are bi encoders.
Best bi encoder (pink) is the one with gradient checkpointing and very large batch size.
In rocketqa paper, they list some tricks, regarding negative examples generation: generate negative with dense embedding instead of BM25 (they say the same in ANCE paper) and filtering too good negative with cross encoding, are you using some of those?
Do you apply some tricks on the dimension reduction like applying layer norm at the end like in https://arxiv.org/pdf/2012.15156.pdf / https://arxiv.org/pdf/2005.00181.pdf?
In my case, too good negative are filtered by rules (does the question appear in the title / text, etc.), filtering with cross encoder changes almost nothing to both cross encoder and bi encoder models (before we used those strict rules, filtering with a cross encoder model had a large impact on scores, and now it has almost no effect on accuracy score, so the rules are probably good enough).
Also I noticed that you are using a 1 positive for 4 negative examples on cross encoder but not for bi-encoder, have you noticed a difference in behavior? (on rocket qa they are using 1-4 ratio for dense embeddings).
So you got the idea, is there some secret sauce?
FWIW
...
if model_args.bi_encoder:
word_embedding_model = models.Transformer(
model_args.model_name_or_path,
max_seq_length=model_args.seq_len,
model_args={"gradient_checkpointing": model_args.gradient_checkpointing},
)
pooling_model = models.Pooling(word_embedding_model.get_word_embedding_dimension())
dense_model = models.Dense(
in_features=pooling_model.get_sentence_embedding_dimension(),
out_features=256,
activation_function=nn.Tanh(),
)
model = SentenceTransformer(modules=[word_embedding_model, pooling_model, dense_model])
train_loss = losses.MultipleNegativesRankingLoss(model=model)
wandb.watch(model)
model.fit(
train_objectives=[(train_dataloader, train_loss)],
evaluator=evaluator,
epochs=model_args.num_train_epochs,
warmup_steps=model_args.warmup_steps,
evaluation_steps=model_args.eval_steps,
optimizer_params={"lr": model_args.learning_rate, "eps": 1e-6, "correct_bias": False},
output_path=model_args.output_dir,
use_amp=model_args.fp16,
)
else:
model = CrossEncoder(model_args.model_name_or_path, num_labels=1, max_length=model_args.seq_len)
wandb.watch(model.model)
model.fit(
train_dataloader=train_dataloader,
evaluator=evaluator,
epochs=model_args.num_train_epochs,
evaluation_steps=model_args.eval_steps,
optimizer_params={"lr": model_args.learning_rate, "eps": 1e-6, "correct_bias": False},
warmup_steps=model_args.warmup_steps,
output_path=model_args.output_dir,
use_amp=model_args.fp16,
)
Issue Analytics
- State:
- Created 3 years ago
- Comments:14 (11 by maintainers)
Thank you! I have added wandb logging to
EmbeddingSimilarityEvaluator
: https://github.com/djstrong/sentence-transformers/tree/wandb if someone interested.@pommedeterresautee No, I did not imply to train with
Asym
class. I added this to the docs to make it more clear if you have a symmetric (query & doc have the same amount of content) or if you have an asymmetric case (query is short, doc is longer).Currently we evaluate what the best method is for the asymmetric case. But so far we don’t have a conclusion. Once we have a conclusion, and know hot to train for asymmetric cases (short query, long doc), the docs will be updated with the respective recommendations.