No tokens left for this setting. Consider raising prob_limit=-15
See original GitHub issueI am following the instruction on youtube to reproduce the word embeddings.
Here is the code:
from whatlies import Embedding, EmbeddingSet
from whatlies.language import SpacyLanguage
lang_spacy = SpacyLanguage("en_core_web_md")
lang_spacy.score_similar('university')
I´ll get following error:
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-12-9705af8b0442> in <module>
----> 1 lang_spacy.score_similar('university')
~\Anaconda3\envs\nlpenv\lib\site-packages\whatlies\language\spacy_lang.py in score_similar(self, emb, n, prob_limit, lower, metric)
222 emb = self[emb]
223
--> 224 queries = self._prepare_queries(prob_limit, lower)
225 distances = self._calculate_distances(emb, queries, metric)
226 by_similarity = sorted(zip(queries, distances), key=lambda z: z[1])
~\Anaconda3\envs\nlpenv\lib\site-packages\whatlies\language\spacy_lang.py in _prepare_queries(self, prob_limit, lower)
134 queries = [w for w in queries if w.is_lower]
135 if len(queries) == 0:
--> 136 raise ValueError(
137 f"No tokens left for this setting. Consider raising prob_limit={prob_limit}"
138 )
ValueError: No tokens left for this setting. Consider raising prob_limit=-15
If i change the prob_limit to -15 I get the same error message.
If I modify the last line of code into lang_spacy.score_similar('university', prob_limit=None)
the code works but I get following results:
[(Emb[university], 5.960464477539063e-08),
(Emb[where], 0.6037042140960693),
(Emb[who], 0.6461576223373413),
(Emb[there], 0.6536556482315063),
(Emb[he], 0.6619330644607544),
(Emb[she], 0.6653806567192078),
(Emb[must], 0.6659252047538757),
(Emb[should], 0.6668630838394165),
(Emb[how], 0.6717931032180786),
(Emb[what], 0.6724168062210083)]
But this can´t be right or? I don´t want subwords.
Any suggestions what I am doing wrong?
spacy version 2.3.2 whatlies version 0.4.2
Issue Analytics
- State:
- Created 3 years ago
- Comments:16 (12 by maintainers)
Top Results From Across the Web
No results found
Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start FreeTop Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Top GitHub Comments
Yeah they seem to have added a big change.
Should be relatively easy to fix. What I’m more worried about is how to properly test this. We might need to seriously start thinking about model caching.
@mkaze this was mainly a question out of curiosity actually 😄. To be clear: I’m really happy with the contributions you’ve made! It’s just that I maintain quite a few projects and it just struck me that your activity certainly is above average.
I share your experience with big open source projects. Open source projects can get really bureaucratic once they’re big. I recall a collegue who had a PySpark PR that got merged after five years. It’s also my preference to keep it smaller where possible.
I agree that model interpretability is indeed a worthy domain. It’s funny it wasn’t the goal initially of this library. My initial goal was to write a library to help me explain word embeddings better. Sofar, the more I’m able to visualise the embeddings the more I recognize how much hype there has been on the topic. King - Man + Woman = King (not queen).
Some members of our research team also like it to explore new embeddings with this library but it’s safe to say there’s more users outside of Rasa now than inside.