question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Same model & data, similarity scores changed

See original GitHub issue

Hi there, I’ve just upgraded to version 0.3.8 from 0.3.2. After the update, I noticed that running the same notebook I was working on, which finds the most similar texts to predefined queries, returned different rankings than before (e.g. one sentence that had similarity score about 0.9 wrt the query - and was therefore ranked as the most similar text in the corpus - is now ranked second with a score of about 0.6). I am loading the same corpus, using the same text queries and the same model (distiluse-base-multilingual-cased) as before. Any idea what may have caused this change? I realize that several releases occurred in between so it may be close to impossible to answer, but if you have any clue I’d be curious to know. Thanks in advance!

Issue Analytics

  • State:open
  • Created 3 years ago
  • Comments:8 (6 by maintainers)

github_iconTop GitHub Comments

2reactions
nreimerscommented, Oct 21, 2020

Hi, I uploaded the old version as distiluse-base-multilingual-case-v1, which is a version that supports (only) 15 languages.

The version that supports 50+ languages is uploaded as distiluse-base-multilingual-case-v2.

1reaction
freeIsacommented, Oct 21, 2020

If you are curious and want to look into this, this is a few lines of code demonstrating that embeddings do change when switching model versions:

MODEL_NAME = '../../../.cache/torch/sentence_transformers/public.ukp.informatik.tu-darmstadt.de_reimers_sentence-transformers_v0.2_distiluse-base-multilingual-cased.zip'
model = SentenceTransformer(MODEL_NAME)
old_model_emb = model.encode("Love is love", convert_to_tensor=True)

MODEL_NAME = 'distiluse-base-multilingual-cased'
model = SentenceTransformer(MODEL_NAME)
new_model_emb = model.encode("Love is love", convert_to_tensor=True)

print(new_model_emb[:5])
print(old_model_emb[:5])

resulting in:

tensor([ 0.0848, -0.0112, -0.0094, -0.0278, -0.0173], device='cuda:0')
tensor([ 0.0863, -0.0242, -0.0092, -0.0280,  0.0026], device='cuda:0')

🤔

Read more comments on GitHub >

github_iconTop Results From Across the Web

A Practitioner's Guide to Similarity Scoring, Part 1
The aim of similarity scoring is to create a function that takes a pair of objects and produces a numerical score quantifying their...
Read more >
Similarity Score - an overview | ScienceDirect Topics
Every time two clusters are merged, similarity scores between clusters need to be recomputed to find the new closest cluster pairs. Specifically, merging...
Read more >
Similarity Measures: Check Your Understanding
How does similarity between music videos change? Popular videos become less similar than less popular videos.
Read more >
Why do my similarity values change after each running with ...
... with the same training data, it won't be presented to the model in exactly the same order between runs, changing the final...
Read more >
Using residualized change versus difference scores for ...
In contrast, when we fit the difference score model to the same data, less than 1% of the variance in the outcome is...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found