Errors when updating already trained TextCategorizer
See original GitHub issueHow to reproduce the behaviour
pip install spacy[cuda110], scispacy
pip install https://s3-us-west-2.amazonaws.com/ai2-s2-scispacy/releases/v0.3.0/en_core_sci_lg-0.3.0.tar.gz
This week I successfully trained a TextCategorizer on gpu.
The following is the gist of the training code:
spacy.prefer_gpu()
# get names of other pipes to disable them during training
other_pipes = [pipe for pipe in nlp.pipe_names if pipe != "textcat"]
with nlp.disable_pipes(*other_pipes): # only train textcat
# if base
if continue_training:
# Start with an existing model, use default optimizer
optimizer = nlp.resume_training()
else:
optimizer = nlp.begin_training()
...
for i in range(epochs):
for batch, dropout in zip(batches, dropouts):
texts, annotations = zip(*batch)
nlp.update(texts, annotations, sgd=optimizer, drop=dropout, losses=losses)
# store final model
with nlp.use_params(optimizer.averages):
nlp.to_disk(output_dir)
The model was saved to directory spacy_ps_20201216
.
Inside that directory are the following artifacts:
meta.json, textcat, tokenizer, vocab
Now, I want to load that same model and continue training.
From my understanding, I think the code, nlp.resume_training()
, would do the trick; but I’m having issues either loading or updating the textcat to continue such training.
Here are the multiple ways I’ve attempted loading the textcat model, and their respective errors:
base_model = 'spacy_ps_20201216'
nlp = spacy.load(base_model)
Note: This works until the script reaches the update
step
Traceback (most recent call last):
File "spacy_textcat.py", line 310, in <module>
plac.call(main)
File "/opt/conda/envs/path-cancer-spacy/lib/python3.8/site-packages/plac_core.py", line 367, in call
cmd, result = parser.consume(arglist)
File "/opt/conda/envs/path-cancer-spacy/lib/python3.8/site-packages/plac_core.py", line 232, in consume
return cmd, self.func(*(args + varargs + extraopts), **kwargs)
File "spacy_textcat.py", line 295, in main
train_textcat(nlp,
File "spacy_textcat.py", line 186, in train_textcat
nlp.update(texts, annotations, sgd=optimizer, drop=dropout, losses=losses)
File "/opt/conda/envs/path-cancer-spacy/lib/python3.8/site-packages/spacy/language.py", line 529, in update
proc.update(docs, golds, sgd=get_grads, losses=losses, **kwargs)
File "pipes.pyx", line 1020, in spacy.pipeline.pipes.TextCategorizer.update
File "/opt/conda/envs/path-cancer-spacy/lib/python3.8/site-packages/thinc/neural/_classes/feed_forward.py", line 46, in begin_update
X, inc_layer_grad = layer.begin_update(X, drop=drop)
File "/opt/conda/envs/path-cancer-spacy/lib/python3.8/site-packages/thinc/neural/pooling.py", line 30, in begin_update
res, bp_res = func.begin_update((X, lengths))
File "/opt/conda/envs/path-cancer-spacy/lib/python3.8/site-packages/thinc/neural/pooling.py", line 53, in mean_pool
output = ops.mean_pool(X, lengths)
File "thinc/neural/ops.pyx", line 733, in thinc.neural.ops.NumpyOps.mean_pool
File "stringsource", line 658, in View.MemoryView.memoryview_cwrapper
File "stringsource", line 349, in View.MemoryView.memoryview.__cinit__
TypeError: a bytes-like object is required, not 'cupy.core.core.ndarray'
I then tried to load the initial model, then add the loaded textcat from the base_model.
nlp = spacy.load('en_core_sci_lg')
nlp.add_pipe(spacy.load(base_model).get_pipe('textcat'))
Note: This works on loading, but when update
line is reached, I receive a familiar error.
Traceback (most recent call last):
File "spacy_textcat.py", line 307, in <module>
plac.call(main)
File "/opt/conda/envs/path-cancer-spacy/lib/python3.8/site-packages/plac_core.py", line 367, in call
cmd, result = parser.consume(arglist)
File "/opt/conda/envs/path-cancer-spacy/lib/python3.8/site-packages/plac_core.py", line 232, in consume
return cmd, self.func(*(args + varargs + extraopts), **kwargs)
File "spacy_textcat.py", line 292, in main
train_textcat(nlp,
File "spacy_textcat.py", line 183, in train_textcat
nlp.update(texts, annotations, sgd=optimizer, drop=dropout, losses=losses)
File "/opt/conda/envs/path-cancer-spacy/lib/python3.8/site-packages/spacy/language.py", line 529, in update
proc.update(docs, golds, sgd=get_grads, losses=losses, **kwargs)
File "pipes.pyx", line 1020, in spacy.pipeline.pipes.TextCategorizer.update
File "/opt/conda/envs/path-cancer-spacy/lib/python3.8/site-packages/thinc/neural/_classes/feed_forward.py", line 46, in begin_update
X, inc_layer_grad = layer.begin_update(X, drop=drop)
File "/opt/conda/envs/path-cancer-spacy/lib/python3.8/site-packages/thinc/neural/pooling.py", line 30, in begin_update
res, bp_res = func.begin_update((X, lengths))
File "/opt/conda/envs/path-cancer-spacy/lib/python3.8/site-packages/thinc/neural/pooling.py", line 53, in mean_pool
output = ops.mean_pool(X, lengths)
File "thinc/neural/ops.pyx", line 733, in thinc.neural.ops.NumpyOps.mean_pool
File "stringsource", line 658, in View.MemoryView.memoryview_cwrapper
File "stringsource", line 349, in View.MemoryView.memoryview.__cinit__
TypeError: a bytes-like object is required, not 'cupy.core.core.ndarray'
I then hopped out of the script to try some other ways:
>>> base_model = 'spacy_ps_20201216/textcat'
>>> nlp = spacy.load('en_core_sci_lg')
>>> textcat = spacy.pipeline.TextCategorizer(nlp.vocab)
>>> textcat.from_disk(base_model)
ValueError: Can't read file: spacy_ps_20201216/textcat/vocab/strings.json
I then moved spacy_ps_20201216/vocab
into spacy_ps_20201216/textcat
and reran successfully.
I updated my script code to have that change and received the following error when updating:
Traceback (most recent call last):
File "spacy_textcat.py", line 312, in <module>
plac.call(main)
File "/opt/conda/envs/path-cancer-spacy/lib/python3.8/site-packages/plac_core.py", line 367, in call
cmd, result = parser.consume(arglist)
File "/opt/conda/envs/path-cancer-spacy/lib/python3.8/site-packages/plac_core.py", line 232, in consume
return cmd, self.func(*(args + varargs + extraopts), **kwargs)
File "spacy_textcat.py", line 297, in main
train_textcat(nlp,
File "spacy_textcat.py", line 188, in train_textcat
nlp.update(texts, annotations, sgd=optimizer, drop=dropout, losses=losses)
File "/opt/conda/envs/path-cancer-spacy/lib/python3.8/site-packages/spacy/language.py", line 529, in update
proc.update(docs, golds, sgd=get_grads, losses=losses, **kwargs)
File "pipes.pyx", line 1016, in spacy.pipeline.pipes.TextCategorizer.update
File "pipes.pyx", line 88, in spacy.pipeline.pipes.Pipe.require_model
ValueError: [E109] Model for component 'textcat' not initialized. Did you forget to load a model, or forget to call begin_training()?
Back to the terminal with base_model='spacy_ps_20201216'
:
>>> from spacy.lang.en import English
>>> nlp = English().from_disk(base_model,exclude=["parser","tagger", "ner"])
>>> nlp.pipe_names
[]
>>> nlp = spacy.load('en_core_sci_lg')
>>> nlp.from_disk(base_model,exclude=["parser","tagger", "ner"])
<spacy.lang.en.English object at 0x7f9a02cd8e50>
>>> nlp.pipe_names
['tagger', 'parser', 'ner']
>>> nlp = nlp.from_disk(base_model,exclude=["parser","tagger", "ner"])
>>> nlp.pipe_names
['tagger', 'parser', 'ner']
I would appreciate any pointers on the right way to load a trained TextCategorizer and continue training.
Your Environment
- spaCy version: 2.3.5
- Platform: Linux-4.19.0-12-cloud-amd64-x86_64-with-glibc2.10
- Python version: 3.8.6
- Operating System (GCP Image): pytorch-1-6-cu110-notebooks-v20201105-debian-10
Issue Analytics
- State:
- Created 3 years ago
- Comments:9 (6 by maintainers)
Top GitHub Comments
Cool, I was able to reproduce your problem with that last script. And the good news is the fix should be pretty easy: just put
spacy.prefer_gpu()
all the way at the top of your script, before you load thenlp
model from file 😃Totally understood, my apologies!
I’ve trimmed the script to the code below. Can confirm that if I remove
spacy.prefer_gpu()
, the code runs.