Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Same model & data, similarity scores changed

See original GitHub issue

Hi there, I’ve just upgraded to version 0.3.8 from 0.3.2. After the update, I noticed that running the same notebook I was working on, which finds the most similar texts to predefined queries, returned different rankings than before (e.g. one sentence that had similarity score about 0.9 wrt the query - and was therefore ranked as the most similar text in the corpus - is now ranked second with a score of about 0.6). I am loading the same corpus, using the same text queries and the same model (distiluse-base-multilingual-cased) as before. Any idea what may have caused this change? I realize that several releases occurred in between so it may be close to impossible to answer, but if you have any clue I’d be curious to know. Thanks in advance!

Issue Analytics

State:
Created 3 years ago
Comments:8 (6 by maintainers)

Top GitHub Comments

2reactions

nreimerscommented, Oct 21, 2020

Hi, I uploaded the old version as distiluse-base-multilingual-case-v1, which is a version that supports (only) 15 languages.

The version that supports 50+ languages is uploaded as distiluse-base-multilingual-case-v2.

1reaction

freeIsacommented, Oct 21, 2020

If you are curious and want to look into this, this is a few lines of code demonstrating that embeddings do change when switching model versions:

MODEL_NAME = '../../../.cache/torch/sentence_transformers/public.ukp.informatik.tu-darmstadt.de_reimers_sentence-transformers_v0.2_distiluse-base-multilingual-cased.zip'
model = SentenceTransformer(MODEL_NAME)
old_model_emb = model.encode("Love is love", convert_to_tensor=True)

MODEL_NAME = 'distiluse-base-multilingual-cased'
model = SentenceTransformer(MODEL_NAME)
new_model_emb = model.encode("Love is love", convert_to_tensor=True)

print(new_model_emb[:5])
print(old_model_emb[:5])

resulting in:

tensor([ 0.0848, -0.0112, -0.0094, -0.0278, -0.0173], device='cuda:0')
tensor([ 0.0863, -0.0242, -0.0092, -0.0280,  0.0026], device='cuda:0')

🤔

Top Results From Across the Web

A Practitioner's Guide to Similarity Scoring, Part 1

The aim of similarity scoring is to create a function that takes a pair of objects and produces a numerical score quantifying their...

Similarity Score - an overview | ScienceDirect Topics

Every time two clusters are merged, similarity scores between clusters need to be recomputed to find the new closest cluster pairs. Specifically, merging...

Similarity Measures: Check Your Understanding

How does similarity between music videos change? Popular videos become less similar than less popular videos.

Why do my similarity values change after each running with ...

... with the same training data, it won't be presented to the model in exactly the same order between runs, changing the final...

Using residualized change versus difference scores for ...

In contrast, when we fit the difference score model to the same data, less than 1% of the variance in the outcome is...