Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Calling `doc.vector` with Spacy models with no vectors produce zero length vectors for empty strings

See original GitHub issue

I noticed this in our dev environment with dummy data that used the en_core_web_sm rather than our custom model that has a set of vectors.

nlp = spacy.load("en_core_web_sm")
len(nlp("hello").vector)
>>> 96

len(nlp("").vector)
>>> 0

Whereas doing the same for the en_core_web_lg model produces a doc vector of length 300 for both examples.

This inconsistent lengths causes us some grief from TensorFlow which took a little while to track down. I think this might happen whenever the vector length is zero. If I drop the vectors from the en_core_web_lg model like below, the new model exhibits the same behaviour:

import spacy
from spacy.lang.en import English

nlp = spacy.load("en_core_web_lg")
nlp.vocab = English.Defaults.create_vocab()
nlp.to_disk("lg_no_vectors")
nlp = spacy.load("lg_no_vectors")
len(nlp("").vector)
>>> 0

Info about spaCy

spaCy version: 2.2.4
Platform: Linux-5.3.0-1017-aws-x86_64-with-glibc2.27
Python version: 3.8.2

Issue Analytics

State:
Created 3 years ago
Comments:5 (4 by maintainers)

Top GitHub Comments

1reaction

adrianeboydcommented, May 21, 2020

I see why that could cause some grief.

The vector property uses the tensor property as a backup if there aren’t any vectors, but empty documents don’t get a tensor (no real processing is done for empty docs) and the empty doc itself doesn’t really have access to what the tensor width would have been used for a non-empty doc, either.

It also turns out (I’m surprised I hadn’t noticed before!) that only the tagger sets the tensor, not the parser or ner, so custom pipelines may necessarily not end up with tensors, either. (I’m not even sure whether this is the intended behavior.)

The v2 config is a bit buried (this will all be in an explicit config in v3), but you can get the tensor width from something like nlp.get_pipe("tagger").model.tok2vec.nO, or also just from inspecting a non-empty document.

Because it’s using the shape of the vectors data array underneath when it returns the tensor-as-vector property, I can suggest a hack to get it to return the same length for empty docs:

import spacy
from thinc.neural.util import get_array_module

nlp = spacy.load("en_core_web_sm")

tensor_width = nlp.get_pipe("tagger").model.tok2vec.nO # or tensor_width = len(nlp("a").vector)
assert nlp.vocab.vectors.data.shape == (0, 0) # don't do this to a non-empty vector table!

xp = get_array_module(nlp.vocab.vectors.data)
nlp.vocab.vectors.data = xp.zeros((0, tensor_width), dtype="f")

assert len(nlp("").vector) == 96
assert len(nlp("a").vector) == 96

This is clearly hacky, but I’m not sure there’s a good general-purpose solution here. (Tongue-in-cheek: don’t process empty documents? don’t rely on the vector property for models without vectors?)

0reactions

github-actions[bot]commented, Nov 4, 2021

This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.

Top Results From Across the Web

Vectors · spaCy API Documentation

As of spaCy v3.2, Vectors supports two types of vector tables: ... or supply a shape keyword argument to create an empty table...

Linguistic Features · spaCy Usage Documentation

spaCy is a free open-source library for Natural Language Processing in Python. It features NER, POS tagging, dependency parsing, word vectors and more....

Language Processing Pipelines · spaCy Usage Documentation

spaCy is a free open-source library for Natural Language Processing in Python. It features NER, POS tagging, dependency parsing, word vectors and more....

Embeddings, Transformers and Transfer Learning - spaCy

What's the difference between word vectors and language models? ... spaCy lets you share a single transformer or other token-to-vector (“tok2vec”) embedding ...

Model Architectures · spaCy API Documentation

Unknown tokens are mapped to a zero vector. See the documentation on static vectors for details. Name, Description. nO, The output width of...