Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

No tokens left for this setting. Consider raising prob_limit=-15

See original GitHub issue

I am following the instruction on youtube to reproduce the word embeddings.

Here is the code:

from whatlies import Embedding, EmbeddingSet
from whatlies.language import SpacyLanguage
lang_spacy = SpacyLanguage("en_core_web_md")
lang_spacy.score_similar('university')

I´ll get following error:

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-12-9705af8b0442> in <module>
----> 1 lang_spacy.score_similar('university')

~\Anaconda3\envs\nlpenv\lib\site-packages\whatlies\language\spacy_lang.py in score_similar(self, emb, n, prob_limit, lower, metric)
    222             emb = self[emb]
    223 
--> 224         queries = self._prepare_queries(prob_limit, lower)
    225         distances = self._calculate_distances(emb, queries, metric)
    226         by_similarity = sorted(zip(queries, distances), key=lambda z: z[1])

~\Anaconda3\envs\nlpenv\lib\site-packages\whatlies\language\spacy_lang.py in _prepare_queries(self, prob_limit, lower)
    134             queries = [w for w in queries if w.is_lower]
    135         if len(queries) == 0:
--> 136             raise ValueError(
    137                 f"No tokens left for this setting. Consider raising prob_limit={prob_limit}"
    138             )

ValueError: No tokens left for this setting. Consider raising prob_limit=-15

If i change the prob_limit to -15 I get the same error message. If I modify the last line of code into lang_spacy.score_similar('university', prob_limit=None) the code works but I get following results:

[(Emb[university], 5.960464477539063e-08),
 (Emb[where], 0.6037042140960693),
 (Emb[who], 0.6461576223373413),
 (Emb[there], 0.6536556482315063),
 (Emb[he], 0.6619330644607544),
 (Emb[she], 0.6653806567192078),
 (Emb[must], 0.6659252047538757),
 (Emb[should], 0.6668630838394165),
 (Emb[how], 0.6717931032180786),
 (Emb[what], 0.6724168062210083)]

But this can´t be right or? I don´t want subwords.

Any suggestions what I am doing wrong?

spacy version 2.3.2 whatlies version 0.4.2

Issue Analytics

State:
Created 3 years ago
Comments:16 (12 by maintainers)

Top GitHub Comments

1reaction

koaningcommented, Aug 7, 2020

Yeah they seem to have added a big change.

Should be relatively easy to fix. What I’m more worried about is how to properly test this. We might need to seriously start thinking about model caching.

0reactions

koaningcommented, Aug 7, 2020

@mkaze this was mainly a question out of curiosity actually 😄. To be clear: I’m really happy with the contributions you’ve made! It’s just that I maintain quite a few projects and it just struck me that your activity certainly is above average.

I share your experience with big open source projects. Open source projects can get really bureaucratic once they’re big. I recall a collegue who had a PySpark PR that got merged after five years. It’s also my preference to keep it smaller where possible.

I agree that model interpretability is indeed a worthy domain. It’s funny it wasn’t the goal initially of this library. My initial goal was to write a library to help me explain word embeddings better. Sofar, the more I’m able to visualise the embeddings the more I recognize how much hype there has been on the topic. King - Man + Woman = King (not queen).

Some members of our research team also like it to explore new embeddings with this library but it’s safe to say there’s more users outside of Rasa now than inside.