question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Issue resuming training on tansformer based NER

See original GitHub issue

I’m using the nightly version, I have successfully trained a transformer based NER model and saved it; now I’m trying to resume training on it.

Firstly, I’m not sure if I have set up the config file correctly, the relevant part looks like this:

[components]

[components.ner]
# This is the path to my trained model
source='best-model'

[components.ner.model]
@architectures = "spacy.TransitionBasedParser.v1"
state_type = "ner"
extra_state_tokens = false
hidden_width = 64
maxout_pieces = 2
use_upper = false
nO = null

[components.ner.model.tok2vec]
@architectures = "spacy-transformers.TransformerListener.v1"
grad_factor = 1.0
pooling = {"@layers":"reduce_mean.v1"}

[components.transformer]
# This is the path to my trained model
source='best-model'

[components.transformer.model]
@architectures = "spacy-transformers.TransformerModel.v1"
name = "dccuchile/bert-base-spanish-wwm-uncased"

[components.transformer.model.get_spans]
@span_getters = "spacy-transformers.strided_spans.v1"
window = 128
stride = 96

[components.transformer.model.tokenizer_config]
use_fast = true

Now, after trying to train like this: !python -m spacy train 'config.cfg' --output='model_t' --gpu-id=0 --paths.train train.spacy --paths.dev test.spacy

I’m getting this error message:

2020-10-29 14:36:11.541313: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcudart.so.10.1
ℹ Using GPU: 0

=========================== Initializing pipeline ===========================
Set up nlp object from config
Pipeline: ['transformer', 'ner']
Resuming training for: ['ner', 'transformer']
Created vocabulary
Finished initializing nlp object
Initialized pipeline components: []
✔ Initialized pipeline

============================= Training pipeline =============================
ℹ Pipeline: ['transformer', 'ner']
ℹ Initial learn rate: 0.0
E    #       LOSS TRANS...  LOSS NER  ENTS_F  ENTS_P  ENTS_R  SCORE 
---  ------  -------------  --------  ------  ------  ------  ------
⚠ Aborting and saving the final best model. Encountered exception: CUDA
out of memory. Tried to allocate 94.00 MiB (GPU 0; 15.75 GiB total capacity;
13.81 GiB already allocated; 78.88 MiB free; 14.34 GiB reserved in total by
PyTorch)
✔ Saved pipeline to output directory
model_t2/model-last
Traceback (most recent call last):
  File "/usr/lib/python3.6/runpy.py", line 193, in _run_module_as_main
    "__main__", mod_spec)
  File "/usr/lib/python3.6/runpy.py", line 85, in _run_code
    exec(code, run_globals)
  File "/usr/local/lib/python3.6/dist-packages/spacy/__main__.py", line 4, in <module>
    setup_cli()
  File "/usr/local/lib/python3.6/dist-packages/spacy/cli/_util.py", line 65, in setup_cli
    command(prog_name=COMMAND)
  File "/usr/local/lib/python3.6/dist-packages/click/core.py", line 829, in __call__
    return self.main(*args, **kwargs)
  File "/usr/local/lib/python3.6/dist-packages/click/core.py", line 782, in main
    rv = self.invoke(ctx)
  File "/usr/local/lib/python3.6/dist-packages/click/core.py", line 1259, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/usr/local/lib/python3.6/dist-packages/click/core.py", line 1066, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/usr/local/lib/python3.6/dist-packages/click/core.py", line 610, in invoke
    return callback(*args, **kwargs)
  File "/usr/local/lib/python3.6/dist-packages/typer/main.py", line 497, in wrapper
    return callback(**use_params)  # type: ignore
  File "/usr/local/lib/python3.6/dist-packages/spacy/cli/train.py", line 59, in train_cli
    train(nlp, output_path, use_gpu=use_gpu, stdout=sys.stdout, stderr=sys.stderr)
  File "/usr/local/lib/python3.6/dist-packages/spacy/training/loop.py", line 105, in train
    raise e
  File "/usr/local/lib/python3.6/dist-packages/spacy/training/loop.py", line 85, in train
    for batch, info, is_best_checkpoint in training_step_iterator:
  File "/usr/local/lib/python3.6/dist-packages/spacy/training/loop.py", line 201, in train_while_improving
    score, other_scores = evaluate()
  File "/usr/local/lib/python3.6/dist-packages/spacy/training/loop.py", line 253, in evaluate
    scores = nlp.evaluate(dev_examples)
  File "/usr/local/lib/python3.6/dist-packages/spacy/language.py", line 1312, in evaluate
    docs = list(docs)
  File "/usr/local/lib/python3.6/dist-packages/spacy/util.py", line 1363, in _pipe
    yield from proc.pipe(docs, **kwargs)
  File "spacy/pipeline/transition_parser.pyx", line 170, in pipe
  File "/usr/local/lib/python3.6/dist-packages/spacy/util.py", line 1322, in minibatch
    batch = list(itertools.islice(items, int(batch_size)))
  File "/usr/local/lib/python3.6/dist-packages/spacy/util.py", line 1363, in _pipe
    yield from proc.pipe(docs, **kwargs)
  File "/usr/local/lib/python3.6/dist-packages/spacy_transformers/pipeline_component.py", line 173, in pipe
    self.set_annotations(subbatch, self.predict(subbatch))
  File "/usr/local/lib/python3.6/dist-packages/spacy_transformers/pipeline_component.py", line 189, in predict
    activations = self.model.predict(docs)
  File "/usr/local/lib/python3.6/dist-packages/thinc/model.py", line 312, in predict
    return self._func(self, X, is_train=False)[0]
  File "/usr/local/lib/python3.6/dist-packages/spacy_transformers/layers/transformer_model.py", line 111, in forward
    tensors, bp_tensors = transformer(token_data, is_train)
  File "/usr/local/lib/python3.6/dist-packages/thinc/model.py", line 288, in __call__
    return self._func(self, X, is_train=is_train)
  File "/usr/local/lib/python3.6/dist-packages/thinc/layers/pytorchwrapper.py", line 79, in forward
    Ytorch, torch_backprop = model.shims[0](Xtorch, is_train)
  File "/usr/local/lib/python3.6/dist-packages/thinc/shims/pytorch.py", line 29, in __call__
    return self.predict(inputs), lambda a: ...
  File "/usr/local/lib/python3.6/dist-packages/thinc/shims/pytorch.py", line 38, in predict
    outputs = self._model(*inputs.args, **inputs.kwargs)
  File "/usr/local/lib/python3.6/dist-packages/torch/nn/modules/module.py", line 722, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/usr/local/lib/python3.6/dist-packages/transformers/modeling_bert.py", line 762, in forward
    output_hidden_states=output_hidden_states,
  File "/usr/local/lib/python3.6/dist-packages/torch/nn/modules/module.py", line 722, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/usr/local/lib/python3.6/dist-packages/transformers/modeling_bert.py", line 439, in forward
    output_attentions,
  File "/usr/local/lib/python3.6/dist-packages/torch/nn/modules/module.py", line 722, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/usr/local/lib/python3.6/dist-packages/transformers/modeling_bert.py", line 388, in forward
    intermediate_output = self.intermediate(attention_output)
  File "/usr/local/lib/python3.6/dist-packages/torch/nn/modules/module.py", line 722, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/usr/local/lib/python3.6/dist-packages/transformers/modeling_bert.py", line 333, in forward
    hidden_states = self.intermediate_act_fn(hidden_states)
  File "/usr/local/lib/python3.6/dist-packages/torch/nn/functional.py", line 1369, in gelu
    return torch._C._nn.gelu(input)
**RuntimeError: CUDA out of memory. Tried to allocate 94.00 MiB (GPU 0; 15.75 GiB total capacity; 13.81 GiB already allocated; 78.88 MiB free; 14.34 GiB reserved in total by PyTorch)**

I understand the message is telling me I’m out of memory, but it seems weird that I’m able to train from scratch with no issues but getting this error when trying to resume training on the saved model. Any help is appreciated.

Your Environment

  • spaCy version: 3.0.0rc2
  • Platform: Linux-4.19.112±x86_64-with-Ubuntu-18.04-bionic
  • Python version: 3.6.9
  • Pipelines: es_core_news_md (3.0.0a0), es_dep_news_trf (3.0.0a0)

Issue Analytics

  • State:open
  • Created 3 years ago
  • Reactions:1
  • Comments:11 (5 by maintainers)

github_iconTop GitHub Comments

1reaction
adrianeboydcommented, Dec 8, 2020

It isn’t just for timing purposes because you’re not actually running the final component (which is the NER model you’re trying to train) unless you iterate over that generator. (Earlier versions had the scorer iterate over this generator, and the overall goal here was to separate the pipeline timing from the scorer timing.) I think the previous version was still a bit clunky so I’ve reworked it a bit more. Can you try the updated version here? https://github.com/explosion/spaCy/pull/6386

Looking at this again, I think the problem might actually be that the default batch_size (256) is too high for a GPU if you have some longer dev docs. We’ve trained a fair number of models internally, but we don’t have many docs that are over a paragraph or so long. How many dev docs were you using? Were any particularly long? Using my updated PR, is it better if you manually lower the default batch_size in the evaluate() kwargs?

We’re also running into some memory issues internally on CPU (for xx models that we haven’t published yet) either due to large training corpora or long dev docs, so I’ll be looking into a few spots where we can improve the memory usage in the near future.

Since this is something that may need to be adjusted and have different defaults for CPU vs. GPU, I think we’ll most likely need a way to specify the batch size for evaluate from the config, but I’m not sure exactly how yet. We may need to add a training parameter like eval_batch_size? We’ll have to discuss what makes sense…

(And I still don’t know what’s going on with the differences between training from scratch and resuming.)

1reaction
fcggamoucommented, Nov 12, 2020

Great! Thanks a lot for the workaround, I will test this and post an update.

Read more comments on GitHub >

github_iconTop Results From Across the Web

How to resume training in spacy transformers for NER
I saw that we can do it by changing the config. cfg but clueless about 'what to change?' OSError: [E884] The pipeline could...
Read more >
A Review of Named Entity Recognition (NER) Using ...
NER For Resume Summarization. Dataset : The first task at hand of course is to create manually annotated training data to train the...
Read more >
How to Train an NER model with HuggingFace?
In this article, we will be focusing on NER and its real-world use cases, and we will train our custom model using HuggingFace...
Read more >
Fine-tune a pretrained model
For users who prefer to write their own training loop, you can also fine-tune a Transformers model in native PyTorch. At this point,...
Read more >
Resume (CV) Parsing using Spacy 3 | NER Training in Spacy v3
In this lecture, I will take you through the spacy v3 ner training. NER training in spacy v3 will cover everything. Watch till...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found