TypeError deserializing a DocBin when user_data is None
See original GitHub issueI’ve been trying out 3.0.0rc1, enjoying the new config files, and found a possibly unexpected behavior. If the user_data
attribute of a Doc instance is None, and you serialize it to a DocBin with save_user_data=True
, then using the DocBin in training with spacy train
causes a TypeError.
[...]
File "...\lib\site-packages\spacy\training\corpus.py", line 153, in make_examples
for reference in reference_docs:
File "...\lib\site-packages\spacy\training\corpus.py", line 188, in read_docbin
for doc in docs:
File "...\lib\site-packages\spacy\tokens\_serialize.py", line 135, in get_docs
doc.user_data.update(user_data)
TypeError: 'NoneType' object is not iterable
OK, figuring out what went wrong here is straightforward: doc.user_data
really shouldn’t be None. Deserializing the DocBin might be just one place where this would be a problem. Should there be a type check anyway?
Reproducing
Set a user_data
Doc attribute to None before serializing to DocBin.
import spacy
from spacy.tokens import DocBin
texts = [
"Class for managing annotated corpora for training and evaluation data.",
"Storage for entities and aliases of a knowledge base for entity linking.",
"Container for convenient access to large lookup tables and dictionaries.",
"Store morphological analyses and map them to and from hash values.",
]
nlp = spacy.load("en_core_web_trf")
docs = []
for i, text in enumerate(texts):
doc = nlp.make_doc(text)
doc.cats = {"label": i % 2}
doc.user_data = None # oops
docs.append(doc)
with open("train.spacy", "wb") as f:
f.write(DocBin(docs=docs[:2], store_user_data=True).to_bytes())
with open("dev.spacy", "wb") as f:
f.write(DocBin(docs=docs[2:], store_user_data=True).to_bytes())
Run spacy train
(test.cfg included below).
python -m spacy train test.cfg
full traceback:
ℹ Using CPU
?[1m
=========================== Initializing pipeline ===========================?[0m
Set up nlp object from config
Pipeline: ['textcat']
Created vocabulary
Finished initializing nlp object
Traceback (most recent call last):
File "...\lib\runpy.py", line 194, in _run_module_as_main
return _run_code(code, main_globals, None,
File "...\lib\runpy.py", line 87, in _run_code
exec(code, run_globals)
File "...\lib\site-packages\spacy\__main__.py", line 4, in <module>
setup_cli()
File "...\lib\site-packages\spacy\cli\_util.py", line 65, in setup_cli
command(prog_name=COMMAND)
File "...\lib\site-packages\click\core.py", line 829, in __call__
return self.main(*args, **kwargs)
File "...\lib\site-packages\click\core.py", line 782, in main
rv = self.invoke(ctx)
File "...\lib\site-packages\click\core.py", line 1259, in invoke
return _process_result(sub_ctx.command.invoke(sub_ctx))
File "...\lib\site-packages\click\core.py", line 1066, in invoke
return ctx.invoke(self.callback, **ctx.params)
File "...\lib\site-packages\click\core.py", line 610, in invoke
return callback(*args, **kwargs)
File "...\lib\site-packages\typer\main.py", line 497, in wrapper
return callback(**use_params) # type: ignore
File "...\lib\site-packages\spacy\cli\train.py", line 56, in train_cli
nlp = init_nlp(config, use_gpu=use_gpu)
File "...\lib\site-packages\spacy\training\initialize.py", line 51, in init_nlp
nlp.initialize(lambda: train_corpus(nlp), sgd=optimizer)
File "...\lib\site-packages\spacy\language.py", line 1232, in initialize
proc.initialize(get_examples, nlp=self, **p_settings)
File "...\lib\site-packages\spacy\pipeline\textcat.py", line 326, in initialize
validate_get_examples(get_examples, "TextCategorizer.initialize")
File "spacy\training\example.pyx", line 62, in spacy.training.example.validate_get_examples
File "spacy\training\example.pyx", line 41, in spacy.training.example.validate_examples
File "...\lib\site-packages\spacy\training\corpus.py", line 131, in __call__
for real_eg in examples:
File "...\lib\site-packages\spacy\training\corpus.py", line 153, in make_examples
for reference in reference_docs:
File "...\lib\site-packages\spacy\training\corpus.py", line 188, in read_docbin
for doc in docs:
File "...\lib\site-packages\spacy\tokens\_serialize.py", line 135, in get_docs
doc.user_data.update(user_data)
TypeError: 'NoneType' object is not iterable
test.cfg
:
[paths]
train = "train.spacy"
dev = "dev.spacy"
raw = null
init_tok2vec = null
vectors = null
[system]
seed = 0
gpu_allocator = null
[nlp]
lang = "en"
pipeline = ["textcat"]
tokenizer = {"@tokenizers":"spacy.Tokenizer.v1"}
disabled = []
before_creation = null
after_creation = null
after_pipeline_creation = null
[components]
[components.textcat]
factory = "textcat"
threshold = 0.5
[components.textcat.model]
@architectures = "spacy.TextCatCNN.v1"
exclusive_classes = false
nO = null
[components.textcat.model.tok2vec]
@architectures = "spacy.Tok2Vec.v1"
[components.textcat.model.tok2vec.embed]
@architectures = "spacy.MultiHashEmbed.v1"
width = ${components.textcat.model.tok2vec.encode:width}
rows = [10000,5000,5000,5000]
attrs = ["NORM","PREFIX","SUFFIX","SHAPE"]
include_static_vectors = false
[components.textcat.model.tok2vec.encode]
@architectures = "spacy.MaxoutWindowEncoder.v1"
width = 96
depth = 4
window_size = 1
maxout_pieces = 3
[corpora]
[corpora.dev]
@readers = "spacy.Corpus.v1"
path = ${paths:dev}
gold_preproc = ${corpora.train.gold_preproc}
max_length = 0
limit = 0
augmenter = null
[corpora.train]
@readers = "spacy.Corpus.v1"
path = ${paths:train}
gold_preproc = false
max_length = 0
limit = 0
augmenter = null
[training]
train_corpus = "corpora.train"
dev_corpus = "corpora.dev"
seed = ${system.seed}
gpu_allocator = ${system.gpu_allocator}
dropout = 0.2
patience = 1600
max_epochs = 0
max_steps = 20000
eval_frequency = 200
accumulate_gradient = 1
frozen_components = []
before_to_disk = null
[training.batcher]
@batchers = "spacy.batch_by_sequence.v1"
size = 32
get_length = null
[training.logger]
@loggers = "spacy.ConsoleLogger.v1"
progress_bar = false
[training.optimizer]
@optimizers = "Adam.v1"
beta1 = 0.9
beta2 = 0.999
L2_is_weight_decay = true
L2 = 0.01
grad_clip = 1.0
eps = 0.00000001
learn_rate = 0.001
use_averages = true
[training.score_weights]
cats_score_desc = null
cats_p = null
cats_r = null
cats_f = null
cats_macro_f = null
cats_macro_auc = null
cats_f_per_type = null
cats_macro_auc_per_type = null
cats_score = 1.0
[pretraining]
[initialize]
vectors = ${paths.vectors}
init_tok2vec = ${paths.init_tok2vec}
vocab_data = null
lookups = null
[initialize.components]
[initialize.tokenizer]
environment
- spaCy version: 3.0.0rc1
- Platform: Windows-10-10.0.17763-SP0
- Python version: 3.8.5
- Pipelines: en_core_web_trf (3.0.0a0)
Issue Analytics
- State:
- Created 3 years ago
- Comments:5 (3 by maintainers)
Top Results From Across the Web
DocBin · spaCy API Documentation
The DocBin class lets you efficiently serialize the information from a collection of Doc objects. You can control which information is serialized by...
Read more >What is the recommended way to serialize a ... - Stack Overflow
As of Spacy 2.2, the correct answer is to use DocBin. ... The DocBin class makes it easy to serialize and deserialize a...
Read more >textacy Documentation - Read the Docs
Access and extend spaCy's core functionality for working with one or many documents through convenient methods and custom extensions.
Read more >textacy Documentation
store_user_data – If True, store user data and values of custom extension attributes ... TypeError – if lang is None when format="binary".
Read more >[Example code]-Can't save custom subclassed model
model.get_config() and model.save(). So, there is no way to save model by using subclassing. ... As such, they can be safely serialized and...
Read more >Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start FreeTop Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Top GitHub Comments
But do let us know if it’s getting set to
None
somewhere with spacy itself by a built-in component, since that would definitely be a bug…This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.