question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

TypeError deserializing a DocBin when user_data is None

See original GitHub issue

I’ve been trying out 3.0.0rc1, enjoying the new config files, and found a possibly unexpected behavior. If the user_data attribute of a Doc instance is None, and you serialize it to a DocBin with save_user_data=True, then using the DocBin in training with spacy train causes a TypeError.

[...]
  File "...\lib\site-packages\spacy\training\corpus.py", line 153, in make_examples
    for reference in reference_docs:
  File "...\lib\site-packages\spacy\training\corpus.py", line 188, in read_docbin
    for doc in docs:
  File "...\lib\site-packages\spacy\tokens\_serialize.py", line 135, in get_docs
    doc.user_data.update(user_data)
TypeError: 'NoneType' object is not iterable

OK, figuring out what went wrong here is straightforward: doc.user_data really shouldn’t be None. Deserializing the DocBin might be just one place where this would be a problem. Should there be a type check anyway?

Reproducing

Set a user_data Doc attribute to None before serializing to DocBin.

import spacy
from spacy.tokens import DocBin

texts = [
    "Class for managing annotated corpora for training and evaluation data.",
    "Storage for entities and aliases of a knowledge base for entity linking.",
    "Container for convenient access to large lookup tables and dictionaries.",
    "Store morphological analyses and map them to and from hash values.",
]

nlp = spacy.load("en_core_web_trf")

docs = []
for i, text in enumerate(texts):
    doc = nlp.make_doc(text)
    doc.cats = {"label": i % 2}
    doc.user_data = None  # oops
    docs.append(doc)

with open("train.spacy", "wb") as f:
    f.write(DocBin(docs=docs[:2], store_user_data=True).to_bytes())
with open("dev.spacy", "wb") as f:
    f.write(DocBin(docs=docs[2:], store_user_data=True).to_bytes())

Run spacy train (test.cfg included below).

python -m spacy train test.cfg
full traceback:
ℹ Using CPU
?[1m
=========================== Initializing pipeline ===========================?[0m
Set up nlp object from config
Pipeline: ['textcat']
Created vocabulary
Finished initializing nlp object
Traceback (most recent call last):
  File "...\lib\runpy.py", line 194, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "...\lib\runpy.py", line 87, in _run_code
    exec(code, run_globals)
  File "...\lib\site-packages\spacy\__main__.py", line 4, in <module>
    setup_cli()
  File "...\lib\site-packages\spacy\cli\_util.py", line 65, in setup_cli
    command(prog_name=COMMAND)
  File "...\lib\site-packages\click\core.py", line 829, in __call__
    return self.main(*args, **kwargs)
  File "...\lib\site-packages\click\core.py", line 782, in main
    rv = self.invoke(ctx)
  File "...\lib\site-packages\click\core.py", line 1259, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "...\lib\site-packages\click\core.py", line 1066, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "...\lib\site-packages\click\core.py", line 610, in invoke
    return callback(*args, **kwargs)
  File "...\lib\site-packages\typer\main.py", line 497, in wrapper
    return callback(**use_params)  # type: ignore
  File "...\lib\site-packages\spacy\cli\train.py", line 56, in train_cli
    nlp = init_nlp(config, use_gpu=use_gpu)
  File "...\lib\site-packages\spacy\training\initialize.py", line 51, in init_nlp
    nlp.initialize(lambda: train_corpus(nlp), sgd=optimizer)
  File "...\lib\site-packages\spacy\language.py", line 1232, in initialize
    proc.initialize(get_examples, nlp=self, **p_settings)
  File "...\lib\site-packages\spacy\pipeline\textcat.py", line 326, in initialize
    validate_get_examples(get_examples, "TextCategorizer.initialize")
  File "spacy\training\example.pyx", line 62, in spacy.training.example.validate_get_examples
  File "spacy\training\example.pyx", line 41, in spacy.training.example.validate_examples
  File "...\lib\site-packages\spacy\training\corpus.py", line 131, in __call__
    for real_eg in examples:
  File "...\lib\site-packages\spacy\training\corpus.py", line 153, in make_examples
    for reference in reference_docs:
  File "...\lib\site-packages\spacy\training\corpus.py", line 188, in read_docbin
    for doc in docs:
  File "...\lib\site-packages\spacy\tokens\_serialize.py", line 135, in get_docs
    doc.user_data.update(user_data)
TypeError: 'NoneType' object is not iterable

test.cfg:

[paths]
train = "train.spacy"
dev = "dev.spacy"
raw = null
init_tok2vec = null
vectors = null

[system]
seed = 0
gpu_allocator = null

[nlp]
lang = "en"
pipeline = ["textcat"]
tokenizer = {"@tokenizers":"spacy.Tokenizer.v1"}
disabled = []
before_creation = null
after_creation = null
after_pipeline_creation = null

[components]

[components.textcat]
factory = "textcat"
threshold = 0.5

[components.textcat.model]
@architectures = "spacy.TextCatCNN.v1"
exclusive_classes = false
nO = null

[components.textcat.model.tok2vec]
@architectures = "spacy.Tok2Vec.v1"

[components.textcat.model.tok2vec.embed]
@architectures = "spacy.MultiHashEmbed.v1"
width = ${components.textcat.model.tok2vec.encode:width}
rows = [10000,5000,5000,5000]
attrs = ["NORM","PREFIX","SUFFIX","SHAPE"]
include_static_vectors = false

[components.textcat.model.tok2vec.encode]
@architectures = "spacy.MaxoutWindowEncoder.v1"
width = 96
depth = 4
window_size = 1
maxout_pieces = 3

[corpora]

[corpora.dev]
@readers = "spacy.Corpus.v1"
path = ${paths:dev}
gold_preproc = ${corpora.train.gold_preproc}
max_length = 0
limit = 0
augmenter = null

[corpora.train]
@readers = "spacy.Corpus.v1"
path = ${paths:train}
gold_preproc = false
max_length = 0
limit = 0
augmenter = null

[training]
train_corpus = "corpora.train"
dev_corpus = "corpora.dev"
seed = ${system.seed}
gpu_allocator = ${system.gpu_allocator}
dropout = 0.2
patience = 1600
max_epochs = 0
max_steps = 20000
eval_frequency = 200
accumulate_gradient = 1
frozen_components = []
before_to_disk = null

[training.batcher]
@batchers = "spacy.batch_by_sequence.v1"
size = 32
get_length = null

[training.logger]
@loggers = "spacy.ConsoleLogger.v1"
progress_bar = false

[training.optimizer]
@optimizers = "Adam.v1"
beta1 = 0.9
beta2 = 0.999
L2_is_weight_decay = true
L2 = 0.01
grad_clip = 1.0
eps = 0.00000001
learn_rate = 0.001
use_averages = true

[training.score_weights]
cats_score_desc = null
cats_p = null
cats_r = null
cats_f = null
cats_macro_f = null
cats_macro_auc = null
cats_f_per_type = null
cats_macro_auc_per_type = null
cats_score = 1.0

[pretraining]

[initialize]
vectors = ${paths.vectors}
init_tok2vec = ${paths.init_tok2vec}
vocab_data = null
lookups = null

[initialize.components]

[initialize.tokenizer]

environment

  • spaCy version: 3.0.0rc1
  • Platform: Windows-10-10.0.17763-SP0
  • Python version: 3.8.5
  • Pipelines: en_core_web_trf (3.0.0a0)

Issue Analytics

  • State:closed
  • Created 3 years ago
  • Comments:5 (3 by maintainers)

github_iconTop GitHub Comments

1reaction
adrianeboydcommented, Oct 23, 2020

But do let us know if it’s getting set to None somewhere with spacy itself by a built-in component, since that would definitely be a bug…

0reactions
github-actions[bot]commented, Oct 30, 2021

This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.

Read more comments on GitHub >

github_iconTop Results From Across the Web

DocBin · spaCy API Documentation
The DocBin class lets you efficiently serialize the information from a collection of Doc objects. You can control which information is serialized by...
Read more >
What is the recommended way to serialize a ... - Stack Overflow
As of Spacy 2.2, the correct answer is to use DocBin. ... The DocBin class makes it easy to serialize and deserialize a...
Read more >
textacy Documentation - Read the Docs
Access and extend spaCy's core functionality for working with one or many documents through convenient methods and custom extensions.
Read more >
textacy Documentation
store_user_data – If True, store user data and values of custom extension attributes ... TypeError – if lang is None when format="binary".
Read more >
[Example code]-Can't save custom subclassed model
model.get_config() and model.save(). So, there is no way to save model by using subclassing. ... As such, they can be safely serialized and...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found