Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Word Vectors are case insensitive

See original GitHub issue

Is it expected that en_core_web_lg is using the lower cased glove vectors?

nlp = spacy.load("en_core_web_lg")
np.array_equal(nlp.vocab.get_vector("Apple"), nlp.vocab.get_vector("apple"))

True

The StringStore has different hashes for each:

nlp.vocab.strings['apple'] 
8566208034543834098

nlp.vocab.strings['Apple']
6418411030699964375

But even using these hashing to look up vectors with nlp.vocab.get_vector It is true:

nlp = spacy.load("en_core_web_lg")
np.array_equal(nlp.vocab.get_vector(6418411030699964375),nlp.vocab.get_vector(8566208034543834098))
True

Info about spaCy

spaCy version: 2.0.8
Platform: Linux-4.4.0-1054-aws-x86_64-with-Ubuntu-16.04-xenial
Python version: 3.6.3
Models: en_core_web_lg, en_vectors_web_lg

Issue Analytics

State:
Created 5 years ago
Comments:6 (5 by maintainers)

Top GitHub Comments

10reactions

adrianeboydcommented, Jul 12, 2022

We’ve been working on improved vectors off-and-on in the background for a while (in particular with floret), and finally the English models for v3.4.0 have new English vectors that are case-sensitive.

If you’re curious about the details, the project for training the vectors is available here:

https://github.com/explosion/spacy-vectors-builder/tree/lang/en

The project can train both fastText and floret vectors, but we want to keep using standard/default vectors for the v3 English models, so the new vectors are the fastText vectors.

1reaction

adrianeboydcommented, Aug 26, 2021

Hi @BatMrE , the problem is in the vectors themselves, not code in the spacy library, so it’s not something that can be solved with a PR. We need to update the vectors data that’s loaded into the pretrained English pipelines.

For most languages other than English we trained new fasttext vectors for v2.3+ and we are currently actively working on updating the data sources and training for vectors across the board, although I can’t make any promises about how soon they’ll be included in the pretrained pipelines.

If you have your own vectors, you can always load them into a new pipeline with spacy init vectors. Since the statistical components depend on the vectors, you can’t replace the vectors in en_core_web_lg with a different set without breaking everything, but you can always train a new pipeline from scratch with alternate (case-sensitive) vectors.