Limit of NER tags of a sentence.
See original GitHub issueIs there any limit for example any sentence can have upto 14 NER tags only,while training a new model for NER using spaCy. I gave the the following input to the main function :
main([('Theprovincialgovernormuststillsignthebillbeforeitbecomeslaw,astepseenonlyasaformality',
{'entities': [(0, 2, 'O'),
(3, 12, 'O'),
(13, 20, 'O'),
(21, 24, 'O'),
(25, 29, 'O'),
(30, 33, 'O'),
(34, 36, 'O'),
(37, 40, 'O'),
(41, 46, 'O'),
(47, 48, 'O'),
(49, 55, 'O'),
(56, 58, 'O'),
(59, 59, 'O'),
(60, 60, 'O')]})])
It works until the number of entity tags are less than or equal to 14 as soon as I add one more tag I get this error:
Created blank 'en' model
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-134-04720f00e29f> in <module>()
14 (59, 59, 'O'),
15 (60, 60, 'O'),
---> 16 (65, 68, 'O')]})])
1 frames
/usr/local/lib/python3.6/dist-packages/spacy/language.py in update(self, docs, golds, drop, sgd, losses, component_cfg)
517 kwargs = component_cfg.get(name, {})
518 kwargs.setdefault("drop", drop)
--> 519 proc.update(docs, golds, sgd=get_grads, losses=losses, **kwargs)
520 for key, (W, dW) in grads.items():
521 sgd(W, dW, key=key)
nn_parser.pyx in spacy.syntax.nn_parser.Parser.update()
nn_parser.pyx in spacy.syntax.nn_parser.Parser._init_gold_batch()
transition_system.pyx in spacy.syntax.transition_system.TransitionSystem.get_oracle_sequence()
transition_system.pyx in spacy.syntax.transition_system.TransitionSystem.set_costs()
ValueError: [E024] Could not find an optimal move to supervise the parser. Usually, this means that the model can't be updated in a way that's valid and satisfies the correct annotations specified in the GoldParse. For example, are all labels added to the model? If you're training a named entity recognizer, also make sure that none of your annotated entity spans have leading or trailing whitespace or punctuation. You can also use the experimental `debug-data` command to validate your JSON-formatted training data. For details, run:
python -m spacy debug-data --help
I am using the following as main function:
def main(train_data,test_data=None,model=None,output_dir=None, n_iter=1):
"""Load the model, set up the pipeline and train the entity recognizer."""
if model is not None:
nlp = spacy.load(model) # load existing spaCy model
print("Loaded model '%s'" % model)
else:
nlp = spacy.blank("en") # create blank Language class
print("Created blank 'en' model")
# create the built-in pipeline components and add them to the pipeline
# nlp.create_pipe works for built-ins that are registered with spaCy
if "ner" not in nlp.pipe_names:
ner = nlp.create_pipe("ner")
nlp.add_pipe(ner, last=True)
# otherwise, get it so we can add labels
else:
ner = nlp.get_pipe("ner")
# add labels
for _, annotations in train_data:
for ent in annotations.get("entities"):
ner.add_label(ent[2])
# get names of other pipes to disable them during training
pipe_exceptions = ["ner", "trf_wordpiecer", "trf_tok2vec"]
other_pipes = [pipe for pipe in nlp.pipe_names if pipe not in pipe_exceptions]
# only train NER
with nlp.disable_pipes(*other_pipes) and warnings.catch_warnings():
# show warnings for misaligned entity spans once
warnings.filterwarnings("once", category=UserWarning, module='spacy')
# reset and initialize the weights randomly – but only if we're
# training a new model
optimizer = nlp.begin_training()
for itn in range(n_iter):
random.shuffle(train_data)
losses = {}
# batch up the examples using spaCy's minibatch
#batches = minibatch(train_data)
for text, annotations in train_data:
nlp.update(
[text], # batch of texts
[annotations], # batch of annotations
drop=0.2, # dropout - make it harder to memorise data
sgd=optimizer, # callable to update weights
losses=losses)
print(losses)
'''
# test the trained model
for text, _ in test_data:
doc = nlp(text)
print("Entities", [(ent.text, ent.label_) for ent in doc.ents])
print("Tokens", [(t.text, t.ent_type_, t.ent_iob) for t in doc])
'''
# save model to output directory
if output_dir is not None:
output_dir = Path(output_dir)
if not output_dir.exists():
output_dir.mkdir()
nlp.to_disk(output_dir)
print("Saved model to", output_dir)
# test the saved model
print("Loading from", output_dir)
nlp2 = spacy.load(output_dir)
for text, _ in train_data:
doc = nlp2(text)
print("Entities", [(ent.text, ent.label_) for ent in doc.ents])
print("Tokens", [(t.text, t.ent_type_, t.ent_iob) for t in doc])
Your Environment
- Operating System: Windows 10
- Python Version Used: 3.7.7
- spaCy Version Used: 2.2.4
- Environment Information: Jupyter Notebook
Issue Analytics
- State:
- Created 3 years ago
- Comments:6 (3 by maintainers)
Top Results From Across the Web
Named Entity Recognition - Stanza - Stanford NLP Group
Named entities accessible through Document or Sentence 's properties entities or ents . Token-level NER tags accessible through Token 's properties ner ....
Read more >Named Entity Recognition (NER) Case Study | The Startup
In this article, we analyze a Named Entity Recognition (NER) case study. ... is used to identify labels at a word level, in...
Read more >Unsupervised NER using BERT. TL;DR - Towards Data Science
Given an input sentence to tag entities, very minimal preprocessing is done on input. One of them is casing normalization — sentences with...
Read more >Linguistic Features · spaCy Usage Documentation
It features NER, POS tagging, dependency parsing, word vectors and more. ... The parser also powers the sentence boundary detection, and lets you...
Read more >Named Entity Recognition Tagging - CS230 Deep Learning
We explore the problem of Named Entity Recognition (NER) tagging of sentences. The task is to tag each token in a given sentence...
Read more >Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start FreeTop Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Top GitHub Comments
Maybe an example would be better:
The annotation in the second half of the training tuple corresponds to the arguments to
GoldParse
: https://spacy.io/api/goldparse#initAnd a correction to what I said above: I said IOB tags, when it should have been BILUO tags.
Hmm, I’m having a bit of trouble understanding your example data above. What entities are you labeling? Why is the sentence formatted without whitespace? The default English tokenizer isn’t going to be able to split this sentence into tokens so you can’t easily train a model from this kind of text for English.
If you want to provide
IOBBILUO labels for each token instead of providing entity spans as character offsets, you should use a list ofIOBBILUO tags like["B-PER", "I-PER", "O", "O", ..., "B-LOC"]
["B-PER", "L-PER", "O", "O", ..., "U-LOC"]
instead of the(start, end, label)
tuples. In this case, due to potential tokenization differences you should also provide a list of words aswords: ["The", "words", "are", ..., "."]
along with theentities
inTRAIN_DATA
.If you use character spans, the span end character is exclusive (
end-start
should be the length of the span text). If you use character spans, you just label the entities and don’t label anyO
spans.I think you may have started working from a similar example, but in case you haven’t seen the full script, here’s an example with English data: https://github.com/explosion/spaCy/blob/24ef6680fa0e52656f6f3108d0f2ddfeec142e7e/examples/training/train_ner.py
Edit: BILUO instead of IOB.