Word Vectors are case insensitive
See original GitHub issueIs it expected that en_core_web_lg is using the lower cased glove vectors?
nlp = spacy.load("en_core_web_lg")
np.array_equal(nlp.vocab.get_vector("Apple"), nlp.vocab.get_vector("apple"))
True
The StringStore has different hashes for each:
nlp.vocab.strings['apple']
8566208034543834098
nlp.vocab.strings['Apple']
6418411030699964375
But even using these hashing to look up vectors with nlp.vocab.get_vector It is true:
nlp = spacy.load("en_core_web_lg")
np.array_equal(nlp.vocab.get_vector(6418411030699964375),nlp.vocab.get_vector(8566208034543834098))
True
Info about spaCy
- spaCy version: 2.0.8
- Platform: Linux-4.4.0-1054-aws-x86_64-with-Ubuntu-16.04-xenial
- Python version: 3.6.3
- Models: en_core_web_lg, en_vectors_web_lg
Issue Analytics
- State:
- Created 5 years ago
- Comments:6 (5 by maintainers)
Top Results From Across the Web
word2vec : find words similar in a case insensitive manner
If you want to find similar words in a case-insensitive manner, you should convert all your word vectors to lowercase or uppercase, and...
Read more >find words similar in a case insensitive manner on vectors ...
Hi,. I have access to word vectors on a text corpus of my interest. Now, the issue I am faced with is that...
Read more >Word Embedding Techniques: Word2Vec and TF-IDF Explained
Since it is case sensitive, all words are converted to lowercase. Then, special characters and stopwords are cleared.
Read more >NLP: Get hands dirty with Word2Vec - Honglei Xie
Word2Vec is part of the gensim package, and it is only one of many ... Also one thing you need to remember is...
Read more >Making a vector<string> case insensitive - C Board
(b) if a string appears with upper case first character that when replaced with a lower case of the same character has another...
Read more >
Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free
Top Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
We’ve been working on improved vectors off-and-on in the background for a while (in particular with floret), and finally the English models for v3.4.0 have new English vectors that are case-sensitive.
If you’re curious about the details, the project for training the vectors is available here:
https://github.com/explosion/spacy-vectors-builder/tree/lang/en
The project can train both fastText and floret vectors, but we want to keep using standard/default vectors for the v3 English models, so the new vectors are the fastText vectors.
Hi @BatMrE , the problem is in the vectors themselves, not code in the spacy library, so it’s not something that can be solved with a PR. We need to update the vectors data that’s loaded into the pretrained English pipelines.
For most languages other than English we trained new fasttext vectors for v2.3+ and we are currently actively working on updating the data sources and training for vectors across the board, although I can’t make any promises about how soon they’ll be included in the pretrained pipelines.
If you have your own vectors, you can always load them into a new pipeline with
spacy init vectors
. Since the statistical components depend on the vectors, you can’t replace the vectors inen_core_web_lg
with a different set without breaking everything, but you can always train a new pipeline from scratch with alternate (case-sensitive) vectors.