Calling `doc.vector` with Spacy models with no vectors produce zero length vectors for empty strings
See original GitHub issueI noticed this in our dev environment with dummy data that used the en_core_web_sm
rather than our custom model that has a set of vectors.
nlp = spacy.load("en_core_web_sm")
len(nlp("hello").vector)
>>> 96
len(nlp("").vector)
>>> 0
Whereas doing the same for the en_core_web_lg
model produces a doc vector of length 300 for both examples.
This inconsistent lengths causes us some grief from TensorFlow which took a little while to track down. I think this might happen whenever the vector length is zero. If I drop the vectors from the en_core_web_lg model like below, the new model exhibits the same behaviour:
import spacy
from spacy.lang.en import English
nlp = spacy.load("en_core_web_lg")
nlp.vocab = English.Defaults.create_vocab()
nlp.to_disk("lg_no_vectors")
nlp = spacy.load("lg_no_vectors")
len(nlp("").vector)
>>> 0
Info about spaCy
- spaCy version: 2.2.4
- Platform: Linux-5.3.0-1017-aws-x86_64-with-glibc2.27
- Python version: 3.8.2
Issue Analytics
- State:
- Created 3 years ago
- Comments:5 (4 by maintainers)
Top Results From Across the Web
Vectors · spaCy API Documentation
As of spaCy v3.2, Vectors supports two types of vector tables: ... or supply a shape keyword argument to create an empty table...
Read more >Linguistic Features · spaCy Usage Documentation
spaCy is a free open-source library for Natural Language Processing in Python. It features NER, POS tagging, dependency parsing, word vectors and more....
Read more >Language Processing Pipelines · spaCy Usage Documentation
spaCy is a free open-source library for Natural Language Processing in Python. It features NER, POS tagging, dependency parsing, word vectors and more....
Read more >Embeddings, Transformers and Transfer Learning - spaCy
What's the difference between word vectors and language models? ... spaCy lets you share a single transformer or other token-to-vector (“tok2vec”) embedding ...
Read more >Model Architectures · spaCy API Documentation
Unknown tokens are mapped to a zero vector. See the documentation on static vectors for details. Name, Description. nO, The output width of...
Read more >Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start FreeTop Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Top GitHub Comments
I see why that could cause some grief.
The vector property uses the tensor property as a backup if there aren’t any vectors, but empty documents don’t get a tensor (no real processing is done for empty docs) and the empty doc itself doesn’t really have access to what the tensor width would have been used for a non-empty doc, either.
It also turns out (I’m surprised I hadn’t noticed before!) that only the
tagger
sets the tensor, not theparser
orner
, so custom pipelines may necessarily not end up with tensors, either. (I’m not even sure whether this is the intended behavior.)The v2 config is a bit buried (this will all be in an explicit config in v3), but you can get the tensor width from something like
nlp.get_pipe("tagger").model.tok2vec.nO
, or also just from inspecting a non-empty document.Because it’s using the shape of the vectors data array underneath when it returns the tensor-as-vector property, I can suggest a hack to get it to return the same length for empty docs:
This is clearly hacky, but I’m not sure there’s a good general-purpose solution here. (Tongue-in-cheek: don’t process empty documents? don’t rely on the vector property for models without vectors?)
This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.