question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Retrain JUST the NER component to have character CNN features?

See original GitHub issue

Recent work – (Gu, et al., 2020; “PubMedBERT”) and (Veysel and Talby, 2020) – seems to show that OOVs (writ large, meaning what your word-level vocabulary is or is not) are detrimental to NER performance, at least in bio-medical domains (gene and protein tagging, e.g.). The latter publication above uses a character CNN feeding a LSTM and outperforms the state of the art – including Stanza and BERT-derived models, such as PubMedBERT.

With that in mind, I tried to retrain just a NER model using the ScispaCy^^ base models (en_core_sci_md-0.3.0), and I got the following error when turning on the --chr/--use-chars

The training command:

$ python -m spacy train en /path/to/output/dir /path/to/train.json /path/to/dev.json --base-model 'en_core_sci_md' --pipeline ner -R -v 'en_core_sci_md' -ne 2 --meta-path /path/to/model/en_core_sci_md/meta.json --chr

This gives the following error. (Sorry I can’t copy the whole thing; I’m visually copying and typing from screen to screen at the moment.)

...
ValueError [E149] Error deserializing model. Check that the config used to create the component matches the model being loaded.
...

This happened when the model was reloaded (presumably) from the ground up – parsing, tagging, NER, etc. – to run on the validation/dev set after the first iteration.

Note that the model trains fine, for several iterations, and saves a working model if I omit the --chr/--use-chars flag. This is doubtless because the models have tied parameters and there is no char CNN component to the Tok2Vec features for any of the other parts of the whole pipeline (tagging and parsing). I don’t want to retrain the parser and tagger to use char CNN features, so is there a workaround?

(^^Perhaps I should crosspost there, but this seems to be a spaCy issue – something about model component mismatch.)

Your Environment

  • Operating System: macOS High Sierra 10.13.6 (I know, out of date, but probabaly not relevant!)
  • Python Version Used: Python 3.7.9
  • spaCy Version Used: 2.3.2 (compatible with ScispaCy model)
  • Environment Information: BASH (?)

Issue Analytics

  • State:closed
  • Created 3 years ago
  • Comments:15 (9 by maintainers)

github_iconTop GitHub Comments

1reaction
adrianeboydcommented, Nov 30, 2020

https://nightly.spacy.io/api/architectures#CharacterEmbed

https://nightly.spacy.io/usage/layers-architectures#sublayers

(Hmm, there’s really no good way to search the nightly docs, that doesn’t make things easy to find…)

0reactions
github-actions[bot]commented, Oct 28, 2021

This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.

Read more comments on GitHub >

github_iconTop Results From Across the Web

7. How to Train spaCy NER Model
For our purposes, select, “English”, the language that we are training, “ner” only, the model we are training, “CPU” (GPU is a bit...
Read more >
Modifying and retraining the deep CNN through a 2 step
We propose a local modelling approach using deep convolutional neural networks (CNNs) for fine-grained image classification. Recently, deep CNNs trained ...
Read more >
Training Pipelines & Models · spaCy Usage Documentation
Train and update components on your own data and integrate custom models.
Read more >
Introduction to character level CNN in text classification with ...
This is an introduction to Character -Based Convolutional Neural Networks for text classification.I propose the implementation of this paper: ...
Read more >
How to Develop a Character-Based Neural Language Model ...
Longer sequences offer more context for the model to learn what character to output next but take longer to train and impose more...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found