Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Issue resuming training on tansformer based NER

See original GitHub issue

I’m using the nightly version, I have successfully trained a transformer based NER model and saved it; now I’m trying to resume training on it.

Firstly, I’m not sure if I have set up the config file correctly, the relevant part looks like this:

[components]

[components.ner]
# This is the path to my trained model
source='best-model'

[components.ner.model]
@architectures = "spacy.TransitionBasedParser.v1"
state_type = "ner"
extra_state_tokens = false
hidden_width = 64
maxout_pieces = 2
use_upper = false
nO = null

[components.ner.model.tok2vec]
@architectures = "spacy-transformers.TransformerListener.v1"
grad_factor = 1.0
pooling = {"@layers":"reduce_mean.v1"}

[components.transformer]
# This is the path to my trained model
source='best-model'

[components.transformer.model]
@architectures = "spacy-transformers.TransformerModel.v1"
name = "dccuchile/bert-base-spanish-wwm-uncased"

[components.transformer.model.get_spans]
@span_getters = "spacy-transformers.strided_spans.v1"
window = 128
stride = 96

[components.transformer.model.tokenizer_config]
use_fast = true

Now, after trying to train like this: !python -m spacy train 'config.cfg' --output='model_t' --gpu-id=0 --paths.train train.spacy --paths.dev test.spacy

I’m getting this error message:

2020-10-29 14:36:11.541313: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcudart.so.10.1
ℹ Using GPU: 0

=========================== Initializing pipeline ===========================
Set up nlp object from config
Pipeline: ['transformer', 'ner']
Resuming training for: ['ner', 'transformer']
Created vocabulary
Finished initializing nlp object
Initialized pipeline components: []
✔ Initialized pipeline

============================= Training pipeline =============================
ℹ Pipeline: ['transformer', 'ner']
ℹ Initial learn rate: 0.0
E    #       LOSS TRANS...  LOSS NER  ENTS_F  ENTS_P  ENTS_R  SCORE 
---  ------  -------------  --------  ------  ------  ------  ------
⚠ Aborting and saving the final best model. Encountered exception: CUDA
out of memory. Tried to allocate 94.00 MiB (GPU 0; 15.75 GiB total capacity;
13.81 GiB already allocated; 78.88 MiB free; 14.34 GiB reserved in total by
PyTorch)
✔ Saved pipeline to output directory
model_t2/model-last
Traceback (most recent call last):
  File "/usr/lib/python3.6/runpy.py", line 193, in _run_module_as_main
    "__main__", mod_spec)
  File "/usr/lib/python3.6/runpy.py", line 85, in _run_code
    exec(code, run_globals)
  File "/usr/local/lib/python3.6/dist-packages/spacy/__main__.py", line 4, in <module>
    setup_cli()
  File "/usr/local/lib/python3.6/dist-packages/spacy/cli/_util.py", line 65, in setup_cli
    command(prog_name=COMMAND)
  File "/usr/local/lib/python3.6/dist-packages/click/core.py", line 829, in __call__
    return self.main(*args, **kwargs)
  File "/usr/local/lib/python3.6/dist-packages/click/core.py", line 782, in main
    rv = self.invoke(ctx)
  File "/usr/local/lib/python3.6/dist-packages/click/core.py", line 1259, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/usr/local/lib/python3.6/dist-packages/click/core.py", line 1066, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/usr/local/lib/python3.6/dist-packages/click/core.py", line 610, in invoke
    return callback(*args, **kwargs)
  File "/usr/local/lib/python3.6/dist-packages/typer/main.py", line 497, in wrapper
    return callback(**use_params)  # type: ignore
  File "/usr/local/lib/python3.6/dist-packages/spacy/cli/train.py", line 59, in train_cli
    train(nlp, output_path, use_gpu=use_gpu, stdout=sys.stdout, stderr=sys.stderr)
  File "/usr/local/lib/python3.6/dist-packages/spacy/training/loop.py", line 105, in train
    raise e
  File "/usr/local/lib/python3.6/dist-packages/spacy/training/loop.py", line 85, in train
    for batch, info, is_best_checkpoint in training_step_iterator:
  File "/usr/local/lib/python3.6/dist-packages/spacy/training/loop.py", line 201, in train_while_improving
    score, other_scores = evaluate()
  File "/usr/local/lib/python3.6/dist-packages/spacy/training/loop.py", line 253, in evaluate
    scores = nlp.evaluate(dev_examples)
  File "/usr/local/lib/python3.6/dist-packages/spacy/language.py", line 1312, in evaluate
    docs = list(docs)
  File "/usr/local/lib/python3.6/dist-packages/spacy/util.py", line 1363, in _pipe
    yield from proc.pipe(docs, **kwargs)
  File "spacy/pipeline/transition_parser.pyx", line 170, in pipe
  File "/usr/local/lib/python3.6/dist-packages/spacy/util.py", line 1322, in minibatch
    batch = list(itertools.islice(items, int(batch_size)))
  File "/usr/local/lib/python3.6/dist-packages/spacy/util.py", line 1363, in _pipe
    yield from proc.pipe(docs, **kwargs)
  File "/usr/local/lib/python3.6/dist-packages/spacy_transformers/pipeline_component.py", line 173, in pipe
    self.set_annotations(subbatch, self.predict(subbatch))
  File "/usr/local/lib/python3.6/dist-packages/spacy_transformers/pipeline_component.py", line 189, in predict
    activations = self.model.predict(docs)
  File "/usr/local/lib/python3.6/dist-packages/thinc/model.py", line 312, in predict
    return self._func(self, X, is_train=False)[0]
  File "/usr/local/lib/python3.6/dist-packages/spacy_transformers/layers/transformer_model.py", line 111, in forward
    tensors, bp_tensors = transformer(token_data, is_train)
  File "/usr/local/lib/python3.6/dist-packages/thinc/model.py", line 288, in __call__
    return self._func(self, X, is_train=is_train)
  File "/usr/local/lib/python3.6/dist-packages/thinc/layers/pytorchwrapper.py", line 79, in forward
    Ytorch, torch_backprop = model.shims[0](Xtorch, is_train)
  File "/usr/local/lib/python3.6/dist-packages/thinc/shims/pytorch.py", line 29, in __call__
    return self.predict(inputs), lambda a: ...
  File "/usr/local/lib/python3.6/dist-packages/thinc/shims/pytorch.py", line 38, in predict
    outputs = self._model(*inputs.args, **inputs.kwargs)
  File "/usr/local/lib/python3.6/dist-packages/torch/nn/modules/module.py", line 722, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/usr/local/lib/python3.6/dist-packages/transformers/modeling_bert.py", line 762, in forward
    output_hidden_states=output_hidden_states,
  File "/usr/local/lib/python3.6/dist-packages/torch/nn/modules/module.py", line 722, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/usr/local/lib/python3.6/dist-packages/transformers/modeling_bert.py", line 439, in forward
    output_attentions,
  File "/usr/local/lib/python3.6/dist-packages/torch/nn/modules/module.py", line 722, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/usr/local/lib/python3.6/dist-packages/transformers/modeling_bert.py", line 388, in forward
    intermediate_output = self.intermediate(attention_output)
  File "/usr/local/lib/python3.6/dist-packages/torch/nn/modules/module.py", line 722, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/usr/local/lib/python3.6/dist-packages/transformers/modeling_bert.py", line 333, in forward
    hidden_states = self.intermediate_act_fn(hidden_states)
  File "/usr/local/lib/python3.6/dist-packages/torch/nn/functional.py", line 1369, in gelu
    return torch._C._nn.gelu(input)
**RuntimeError: CUDA out of memory. Tried to allocate 94.00 MiB (GPU 0; 15.75 GiB total capacity; 13.81 GiB already allocated; 78.88 MiB free; 14.34 GiB reserved in total by PyTorch)**

I understand the message is telling me I’m out of memory, but it seems weird that I’m able to train from scratch with no issues but getting this error when trying to resume training on the saved model. Any help is appreciated.

Your Environment

spaCy version: 3.0.0rc2
Platform: Linux-4.19.112±x86_64-with-Ubuntu-18.04-bionic
Python version: 3.6.9
Pipelines: es_core_news_md (3.0.0a0), es_dep_news_trf (3.0.0a0)

Issue Analytics

State:
Created 3 years ago
Reactions:1
Comments:11 (5 by maintainers)

Top GitHub Comments

1reaction

adrianeboydcommented, Dec 8, 2020

It isn’t just for timing purposes because you’re not actually running the final component (which is the NER model you’re trying to train) unless you iterate over that generator. (Earlier versions had the scorer iterate over this generator, and the overall goal here was to separate the pipeline timing from the scorer timing.) I think the previous version was still a bit clunky so I’ve reworked it a bit more. Can you try the updated version here? https://github.com/explosion/spaCy/pull/6386

Looking at this again, I think the problem might actually be that the default batch_size (256) is too high for a GPU if you have some longer dev docs. We’ve trained a fair number of models internally, but we don’t have many docs that are over a paragraph or so long. How many dev docs were you using? Were any particularly long? Using my updated PR, is it better if you manually lower the default batch_size in the evaluate() kwargs?

We’re also running into some memory issues internally on CPU (for xx models that we haven’t published yet) either due to large training corpora or long dev docs, so I’ll be looking into a few spots where we can improve the memory usage in the near future.

Since this is something that may need to be adjusted and have different defaults for CPU vs. GPU, I think we’ll most likely need a way to specify the batch size for evaluate from the config, but I’m not sure exactly how yet. We may need to add a training parameter like eval_batch_size? We’ll have to discuss what makes sense…

(And I still don’t know what’s going on with the differences between training from scratch and resuming.)

1reaction

fcggamoucommented, Nov 12, 2020

Great! Thanks a lot for the workaround, I will test this and post an update.