question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Calling `doc.vector` with Spacy models with no vectors produce zero length vectors for empty strings

See original GitHub issue

I noticed this in our dev environment with dummy data that used the en_core_web_sm rather than our custom model that has a set of vectors.

nlp = spacy.load("en_core_web_sm")
len(nlp("hello").vector)
>>> 96

len(nlp("").vector)
>>> 0

Whereas doing the same for the en_core_web_lg model produces a doc vector of length 300 for both examples.

This inconsistent lengths causes us some grief from TensorFlow which took a little while to track down. I think this might happen whenever the vector length is zero. If I drop the vectors from the en_core_web_lg model like below, the new model exhibits the same behaviour:

import spacy
from spacy.lang.en import English

nlp = spacy.load("en_core_web_lg")
nlp.vocab = English.Defaults.create_vocab()
nlp.to_disk("lg_no_vectors")
nlp = spacy.load("lg_no_vectors")
len(nlp("").vector)
>>> 0

Info about spaCy

  • spaCy version: 2.2.4
  • Platform: Linux-5.3.0-1017-aws-x86_64-with-glibc2.27
  • Python version: 3.8.2

Issue Analytics

  • State:closed
  • Created 3 years ago
  • Comments:5 (4 by maintainers)

github_iconTop GitHub Comments

1reaction
adrianeboydcommented, May 21, 2020

I see why that could cause some grief.

The vector property uses the tensor property as a backup if there aren’t any vectors, but empty documents don’t get a tensor (no real processing is done for empty docs) and the empty doc itself doesn’t really have access to what the tensor width would have been used for a non-empty doc, either.

It also turns out (I’m surprised I hadn’t noticed before!) that only the tagger sets the tensor, not the parser or ner, so custom pipelines may necessarily not end up with tensors, either. (I’m not even sure whether this is the intended behavior.)

The v2 config is a bit buried (this will all be in an explicit config in v3), but you can get the tensor width from something like nlp.get_pipe("tagger").model.tok2vec.nO, or also just from inspecting a non-empty document.

Because it’s using the shape of the vectors data array underneath when it returns the tensor-as-vector property, I can suggest a hack to get it to return the same length for empty docs:

import spacy
from thinc.neural.util import get_array_module

nlp = spacy.load("en_core_web_sm")

tensor_width = nlp.get_pipe("tagger").model.tok2vec.nO # or tensor_width = len(nlp("a").vector)
assert nlp.vocab.vectors.data.shape == (0, 0) # don't do this to a non-empty vector table!

xp = get_array_module(nlp.vocab.vectors.data)
nlp.vocab.vectors.data = xp.zeros((0, tensor_width), dtype="f")

assert len(nlp("").vector) == 96
assert len(nlp("a").vector) == 96

This is clearly hacky, but I’m not sure there’s a good general-purpose solution here. (Tongue-in-cheek: don’t process empty documents? don’t rely on the vector property for models without vectors?)

0reactions
github-actions[bot]commented, Nov 4, 2021

This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.

Read more comments on GitHub >

github_iconTop Results From Across the Web

Vectors · spaCy API Documentation
As of spaCy v3.2, Vectors supports two types of vector tables: ... or supply a shape keyword argument to create an empty table...
Read more >
Linguistic Features · spaCy Usage Documentation
spaCy is a free open-source library for Natural Language Processing in Python. It features NER, POS tagging, dependency parsing, word vectors and more....
Read more >
Language Processing Pipelines · spaCy Usage Documentation
spaCy is a free open-source library for Natural Language Processing in Python. It features NER, POS tagging, dependency parsing, word vectors and more....
Read more >
Embeddings, Transformers and Transfer Learning - spaCy
What's the difference between word vectors and language models? ... spaCy lets you share a single transformer or other token-to-vector (“tok2vec”) embedding ...
Read more >
Model Architectures · spaCy API Documentation
Unknown tokens are mapped to a zero vector. See the documentation on static vectors for details. Name, Description. nO, The output width of...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found