Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Errors when updating already trained TextCategorizer

See original GitHub issue

How to reproduce the behaviour

pip install spacy[cuda110], scispacy
pip install https://s3-us-west-2.amazonaws.com/ai2-s2-scispacy/releases/v0.3.0/en_core_sci_lg-0.3.0.tar.gz

This week I successfully trained a TextCategorizer on gpu.

The following is the gist of the training code:

spacy.prefer_gpu()

# get names of other pipes to disable them during training
other_pipes = [pipe for pipe in nlp.pipe_names if pipe != "textcat"]
with nlp.disable_pipes(*other_pipes):  # only train textcat
    # if base
    if continue_training:
        # Start with an existing model, use default optimizer
        optimizer = nlp.resume_training()
    else:
        optimizer = nlp.begin_training()
    
    ...

    for i in range(epochs):
    	for batch, dropout in zip(batches, dropouts):
            texts, annotations = zip(*batch)
            nlp.update(texts, annotations, sgd=optimizer, drop=dropout, losses=losses)

    # store final model
    with nlp.use_params(optimizer.averages):
		nlp.to_disk(output_dir)

The model was saved to directory spacy_ps_20201216. Inside that directory are the following artifacts: meta.json, textcat, tokenizer, vocab

Now, I want to load that same model and continue training. From my understanding, I think the code, nlp.resume_training(), would do the trick; but I’m having issues either loading or updating the textcat to continue such training.

Here are the multiple ways I’ve attempted loading the textcat model, and their respective errors:

base_model = 'spacy_ps_20201216'
nlp = spacy.load(base_model)

Note: This works until the script reaches the update step

Traceback (most recent call last):
  File "spacy_textcat.py", line 310, in <module>
    plac.call(main)
  File "/opt/conda/envs/path-cancer-spacy/lib/python3.8/site-packages/plac_core.py", line 367, in call
    cmd, result = parser.consume(arglist)
  File "/opt/conda/envs/path-cancer-spacy/lib/python3.8/site-packages/plac_core.py", line 232, in consume
    return cmd, self.func(*(args + varargs + extraopts), **kwargs)
  File "spacy_textcat.py", line 295, in main
    train_textcat(nlp,
  File "spacy_textcat.py", line 186, in train_textcat
    nlp.update(texts, annotations, sgd=optimizer, drop=dropout, losses=losses)
  File "/opt/conda/envs/path-cancer-spacy/lib/python3.8/site-packages/spacy/language.py", line 529, in update
    proc.update(docs, golds, sgd=get_grads, losses=losses, **kwargs)
  File "pipes.pyx", line 1020, in spacy.pipeline.pipes.TextCategorizer.update
  File "/opt/conda/envs/path-cancer-spacy/lib/python3.8/site-packages/thinc/neural/_classes/feed_forward.py", line 46, in begin_update
    X, inc_layer_grad = layer.begin_update(X, drop=drop)
  File "/opt/conda/envs/path-cancer-spacy/lib/python3.8/site-packages/thinc/neural/pooling.py", line 30, in begin_update
    res, bp_res = func.begin_update((X, lengths))
  File "/opt/conda/envs/path-cancer-spacy/lib/python3.8/site-packages/thinc/neural/pooling.py", line 53, in mean_pool
    output = ops.mean_pool(X, lengths)
  File "thinc/neural/ops.pyx", line 733, in thinc.neural.ops.NumpyOps.mean_pool
  File "stringsource", line 658, in View.MemoryView.memoryview_cwrapper
  File "stringsource", line 349, in View.MemoryView.memoryview.__cinit__
TypeError: a bytes-like object is required, not 'cupy.core.core.ndarray'

I then tried to load the initial model, then add the loaded textcat from the base_model.

nlp = spacy.load('en_core_sci_lg')
nlp.add_pipe(spacy.load(base_model).get_pipe('textcat'))

Note: This works on loading, but when update line is reached, I receive a familiar error.

Traceback (most recent call last):
  File "spacy_textcat.py", line 307, in <module>
    plac.call(main)
  File "/opt/conda/envs/path-cancer-spacy/lib/python3.8/site-packages/plac_core.py", line 367, in call
    cmd, result = parser.consume(arglist)
  File "/opt/conda/envs/path-cancer-spacy/lib/python3.8/site-packages/plac_core.py", line 232, in consume
    return cmd, self.func(*(args + varargs + extraopts), **kwargs)
  File "spacy_textcat.py", line 292, in main
    train_textcat(nlp,
  File "spacy_textcat.py", line 183, in train_textcat
    nlp.update(texts, annotations, sgd=optimizer, drop=dropout, losses=losses)
  File "/opt/conda/envs/path-cancer-spacy/lib/python3.8/site-packages/spacy/language.py", line 529, in update
    proc.update(docs, golds, sgd=get_grads, losses=losses, **kwargs)
  File "pipes.pyx", line 1020, in spacy.pipeline.pipes.TextCategorizer.update
  File "/opt/conda/envs/path-cancer-spacy/lib/python3.8/site-packages/thinc/neural/_classes/feed_forward.py", line 46, in begin_update
    X, inc_layer_grad = layer.begin_update(X, drop=drop)
  File "/opt/conda/envs/path-cancer-spacy/lib/python3.8/site-packages/thinc/neural/pooling.py", line 30, in begin_update
    res, bp_res = func.begin_update((X, lengths))
  File "/opt/conda/envs/path-cancer-spacy/lib/python3.8/site-packages/thinc/neural/pooling.py", line 53, in mean_pool
    output = ops.mean_pool(X, lengths)
  File "thinc/neural/ops.pyx", line 733, in thinc.neural.ops.NumpyOps.mean_pool
  File "stringsource", line 658, in View.MemoryView.memoryview_cwrapper
  File "stringsource", line 349, in View.MemoryView.memoryview.__cinit__
TypeError: a bytes-like object is required, not 'cupy.core.core.ndarray'

I then hopped out of the script to try some other ways:

>>> base_model = 'spacy_ps_20201216/textcat'
>>> nlp = spacy.load('en_core_sci_lg')
>>> textcat = spacy.pipeline.TextCategorizer(nlp.vocab)
>>> textcat.from_disk(base_model)
ValueError: Can't read file: spacy_ps_20201216/textcat/vocab/strings.json

I then moved spacy_ps_20201216/vocab into spacy_ps_20201216/textcat and reran successfully. I updated my script code to have that change and received the following error when updating:

Traceback (most recent call last):
  File "spacy_textcat.py", line 312, in <module>
    plac.call(main)
  File "/opt/conda/envs/path-cancer-spacy/lib/python3.8/site-packages/plac_core.py", line 367, in call
    cmd, result = parser.consume(arglist)
  File "/opt/conda/envs/path-cancer-spacy/lib/python3.8/site-packages/plac_core.py", line 232, in consume
    return cmd, self.func(*(args + varargs + extraopts), **kwargs)
  File "spacy_textcat.py", line 297, in main
    train_textcat(nlp,
  File "spacy_textcat.py", line 188, in train_textcat
    nlp.update(texts, annotations, sgd=optimizer, drop=dropout, losses=losses)
  File "/opt/conda/envs/path-cancer-spacy/lib/python3.8/site-packages/spacy/language.py", line 529, in update
    proc.update(docs, golds, sgd=get_grads, losses=losses, **kwargs)
  File "pipes.pyx", line 1016, in spacy.pipeline.pipes.TextCategorizer.update
  File "pipes.pyx", line 88, in spacy.pipeline.pipes.Pipe.require_model
ValueError: [E109] Model for component 'textcat' not initialized. Did you forget to load a model, or forget to call begin_training()?

Back to the terminal with base_model='spacy_ps_20201216':

>>> from spacy.lang.en import English
>>> nlp = English().from_disk(base_model,exclude=["parser","tagger", "ner"])
>>> nlp.pipe_names
[]

>>> nlp = spacy.load('en_core_sci_lg')
>>> nlp.from_disk(base_model,exclude=["parser","tagger", "ner"])
<spacy.lang.en.English object at 0x7f9a02cd8e50>
>>> nlp.pipe_names
['tagger', 'parser', 'ner']

>>> nlp = nlp.from_disk(base_model,exclude=["parser","tagger", "ner"])
>>> nlp.pipe_names
['tagger', 'parser', 'ner']

I would appreciate any pointers on the right way to load a trained TextCategorizer and continue training.

Your Environment

spaCy version: 2.3.5
Platform: Linux-4.19.0-12-cloud-amd64-x86_64-with-glibc2.10
Python version: 3.8.6
Operating System (GCP Image): pytorch-1-6-cu110-notebooks-v20201105-debian-10

Issue Analytics

State:
Created 3 years ago
Comments:9 (6 by maintainers)

Top GitHub Comments

3reactions

svlandegcommented, Dec 27, 2020

Cool, I was able to reproduce your problem with that last script. And the good news is the fix should be pretty easy: just put spacy.prefer_gpu() all the way at the top of your script, before you load the nlp model from file 😃

3reactions

jgieringercommented, Dec 22, 2020

Totally understood, my apologies!

I’ve trimmed the script to the code below. Can confirm that if I remove spacy.prefer_gpu(), the code runs.

import scispacy
import spacy
import random
from pathlib import Path
from spacy.util import minibatch, compounding, decaying


## setup nlp & data

nlp = spacy.load('./tmp/spacy_model') # en_core_sci_lg or ./tmp/spacy_model
continue_training = True if 'textcat' in nlp.pipe_names else False

train_data = [
    ('Innovation in Database Management: Computer Science vs. Engineering',
     {'cats': {'SIGGRAPH': False,
               'VLDB': True,
               'ISCAS': False,
               'INFOCOM': False}}),
    ('High performance prime field multiplication for GPU',
     {'cats': {'SIGGRAPH': False,
               'VLDB': False,
               'ISCAS': True,
               'INFOCOM': False}}),
    ('enchanted scissors: a scissor interface for support in cutting and interactive fabrication',
     {'cats': {'SIGGRAPH': True,
               'VLDB': False,
               'ISCAS': False,
               'INFOCOM': False}}),
    ('Detection of channel degradation attack by Intermediary Node in Linear Networks',
     {'cats': {'SIGGRAPH': False,
               'VLDB': False,
               'ISCAS': False,
               'INFOCOM': True}})
]

train_data = [(nlp.tokenizer(txt), cats) for txt,cats in train_data]
model_labels = set(train_data[0][1]['cats'].keys())


## textcat setup

if not continue_training:
    textcat = nlp.create_pipe(
        'textcat',
        config={
            'exclusive_classes': True,
            'architecture': 'simple_cnn',
        }
    )
    nlp.add_pipe(textcat, last=True)

# store model labels in textcategorizer if they don't already exist
current_labels = set(nlp.get_pipe('textcat').labels)
for l in model_labels:
    if l not in current_labels:
        nlp.get_pipe('textcat').add_label(str(l))

        
## train

epochs = 1
textcat = nlp.get_pipe("textcat")
spacy.prefer_gpu()
        
# get names of other pipes to disable them during training
other_pipes = [pipe for pipe in nlp.pipe_names if pipe != "textcat"]
with nlp.disable_pipes(*other_pipes):  # only train textcat
    # if base
    if continue_training:
        # Start with an existing model, use default optimizer
        optimizer = nlp.resume_training()
    else:
        optimizer = nlp.begin_training()

    # create batch sizes
    min_batch_size, max_batch_size, update_by = (1.,64.,1.001)
    batch_sizes = compounding(min_batch_size, max_batch_size, update_by)

    # create decaying dropout
    starting_dropout, ending_dropout, decay_rate = (0.6, 0.2, 1e-4)
    dropouts = decaying(starting_dropout, ending_dropout, decay_rate)

    for i in range(epochs):
        losses = {}

        # batch up the examples using spaCy's minibatch
        random.shuffle(train_data)
        batches = minibatch(train_data, size=batch_sizes)
        for batch, dropout in zip(batches, dropouts):
            texts, annotations = zip(*batch)
            nlp.update(texts, annotations, sgd=optimizer, drop=dropout, losses=losses)
        
        # export model
        outdir = Path('./tmp/spacy_model')
        if not outdir.exists():
            outdir.mkdir(parents=True)
        with nlp.use_params(optimizer.averages):
            nlp.to_disk(outdir)