question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Word Vectors are case insensitive

See original GitHub issue

Is it expected that en_core_web_lg is using the lower cased glove vectors?

nlp = spacy.load("en_core_web_lg")
np.array_equal(nlp.vocab.get_vector("Apple"), nlp.vocab.get_vector("apple")) 

True

The StringStore has different hashes for each:

nlp.vocab.strings['apple'] 
8566208034543834098
nlp.vocab.strings['Apple']
6418411030699964375

But even using these hashing to look up vectors with nlp.vocab.get_vector It is true:

nlp = spacy.load("en_core_web_lg")
np.array_equal(nlp.vocab.get_vector(6418411030699964375),nlp.vocab.get_vector(8566208034543834098))
True

Info about spaCy

  • spaCy version: 2.0.8
  • Platform: Linux-4.4.0-1054-aws-x86_64-with-Ubuntu-16.04-xenial
  • Python version: 3.6.3
  • Models: en_core_web_lg, en_vectors_web_lg

Issue Analytics

  • State:closed
  • Created 5 years ago
  • Comments:6 (5 by maintainers)

github_iconTop GitHub Comments

10reactions
adrianeboydcommented, Jul 12, 2022

We’ve been working on improved vectors off-and-on in the background for a while (in particular with floret), and finally the English models for v3.4.0 have new English vectors that are case-sensitive.

If you’re curious about the details, the project for training the vectors is available here:

https://github.com/explosion/spacy-vectors-builder/tree/lang/en

The project can train both fastText and floret vectors, but we want to keep using standard/default vectors for the v3 English models, so the new vectors are the fastText vectors.

1reaction
adrianeboydcommented, Aug 26, 2021

Hi @BatMrE , the problem is in the vectors themselves, not code in the spacy library, so it’s not something that can be solved with a PR. We need to update the vectors data that’s loaded into the pretrained English pipelines.

For most languages other than English we trained new fasttext vectors for v2.3+ and we are currently actively working on updating the data sources and training for vectors across the board, although I can’t make any promises about how soon they’ll be included in the pretrained pipelines.

If you have your own vectors, you can always load them into a new pipeline with spacy init vectors. Since the statistical components depend on the vectors, you can’t replace the vectors in en_core_web_lg with a different set without breaking everything, but you can always train a new pipeline from scratch with alternate (case-sensitive) vectors.

Read more comments on GitHub >

github_iconTop Results From Across the Web

word2vec : find words similar in a case insensitive manner
If you want to find similar words in a case-insensitive manner, you should convert all your word vectors to lowercase or uppercase, and...
Read more >
find words similar in a case insensitive manner on vectors ...
Hi,. I have access to word vectors on a text corpus of my interest. Now, the issue I am faced with is that...
Read more >
Word Embedding Techniques: Word2Vec and TF-IDF Explained
Since it is case sensitive, all words are converted to lowercase. Then, special characters and stopwords are cleared.
Read more >
NLP: Get hands dirty with Word2Vec - Honglei Xie
Word2Vec is part of the gensim package, and it is only one of many ... Also one thing you need to remember is...
Read more >
Making a vector<string> case insensitive - C Board
(b) if a string appears with upper case first character that when replaced with a lower case of the same character has another...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found