Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Retrain JUST the NER component to have character CNN features?

See original GitHub issue

Recent work – (Gu, et al., 2020; “PubMedBERT”) and (Veysel and Talby, 2020) – seems to show that OOVs (writ large, meaning what your word-level vocabulary is or is not) are detrimental to NER performance, at least in bio-medical domains (gene and protein tagging, e.g.). The latter publication above uses a character CNN feeding a LSTM and outperforms the state of the art – including Stanza and BERT-derived models, such as PubMedBERT.

With that in mind, I tried to retrain just a NER model using the ScispaCy^^ base models (en_core_sci_md-0.3.0), and I got the following error when turning on the --chr/--use-chars

The training command:

$ python -m spacy train en /path/to/output/dir /path/to/train.json /path/to/dev.json --base-model 'en_core_sci_md' --pipeline ner -R -v 'en_core_sci_md' -ne 2 --meta-path /path/to/model/en_core_sci_md/meta.json --chr

This gives the following error. (Sorry I can’t copy the whole thing; I’m visually copying and typing from screen to screen at the moment.)

...
ValueError [E149] Error deserializing model. Check that the config used to create the component matches the model being loaded.
...

This happened when the model was reloaded (presumably) from the ground up – parsing, tagging, NER, etc. – to run on the validation/dev set after the first iteration.

Note that the model trains fine, for several iterations, and saves a working model if I omit the --chr/--use-chars flag. This is doubtless because the models have tied parameters and there is no char CNN component to the Tok2Vec features for any of the other parts of the whole pipeline (tagging and parsing). I don’t want to retrain the parser and tagger to use char CNN features, so is there a workaround?

(^^Perhaps I should crosspost there, but this seems to be a spaCy issue – something about model component mismatch.)

Your Environment

Operating System: macOS High Sierra 10.13.6 (I know, out of date, but probabaly not relevant!)
Python Version Used: Python 3.7.9
spaCy Version Used: 2.3.2 (compatible with ScispaCy model)
Environment Information: BASH (?)

Issue Analytics

State:
Created 3 years ago
Comments:15 (9 by maintainers)

Top GitHub Comments

1reaction

adrianeboydcommented, Nov 30, 2020

https://nightly.spacy.io/api/architectures#CharacterEmbed

https://nightly.spacy.io/usage/layers-architectures#sublayers

(Hmm, there’s really no good way to search the nightly docs, that doesn’t make things easy to find…)

0reactions

github-actions[bot]commented, Oct 28, 2021

This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.

Top Results From Across the Web

7. How to Train spaCy NER Model

For our purposes, select, “English”, the language that we are training, “ner” only, the model we are training, “CPU” (GPU is a bit...

Modifying and retraining the deep CNN through a 2 step

We propose a local modelling approach using deep convolutional neural networks (CNNs) for fine-grained image classification. Recently, deep CNNs trained ...

Training Pipelines & Models · spaCy Usage Documentation

Train and update components on your own data and integrate custom models.

Introduction to character level CNN in text classification with ...

This is an introduction to Character -Based Convolutional Neural Networks for text classification.I propose the implementation of this paper: ...

How to Develop a Character-Based Neural Language Model ...

Longer sequences offer more context for the model to learn what character to output next but take longer to train and impose more...