Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Limit of NER tags of a sentence.

See original GitHub issue

Is there any limit for example any sentence can have upto 14 NER tags only,while training a new model for NER using spaCy. I gave the the following input to the main function :

main([('Theprovincialgovernormuststillsignthebillbeforeitbecomeslaw,astepseenonlyasaformality',
 {'entities': [(0, 2, 'O'),
   (3, 12, 'O'),
   (13, 20, 'O'),
   (21, 24, 'O'),
   (25, 29, 'O'),
   (30, 33, 'O'),
   (34, 36, 'O'),
   (37, 40, 'O'),
   (41, 46, 'O'),
   (47, 48, 'O'),
   (49, 55, 'O'),
   (56, 58, 'O'),
   (59, 59, 'O'),
   (60, 60, 'O')]})])

It works until the number of entity tags are less than or equal to 14 as soon as I add one more tag I get this error:


Created blank 'en' model
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-134-04720f00e29f> in <module>()
     14    (59, 59, 'O'),
     15    (60, 60, 'O'),
---> 16    (65, 68, 'O')]})])

1 frames
/usr/local/lib/python3.6/dist-packages/spacy/language.py in update(self, docs, golds, drop, sgd, losses, component_cfg)
    517             kwargs = component_cfg.get(name, {})
    518             kwargs.setdefault("drop", drop)
--> 519             proc.update(docs, golds, sgd=get_grads, losses=losses, **kwargs)
    520             for key, (W, dW) in grads.items():
    521                 sgd(W, dW, key=key)

nn_parser.pyx in spacy.syntax.nn_parser.Parser.update()

nn_parser.pyx in spacy.syntax.nn_parser.Parser._init_gold_batch()

transition_system.pyx in spacy.syntax.transition_system.TransitionSystem.get_oracle_sequence()

transition_system.pyx in spacy.syntax.transition_system.TransitionSystem.set_costs()

ValueError: [E024] Could not find an optimal move to supervise the parser. Usually, this means that the model can't be updated in a way that's valid and satisfies the correct annotations specified in the GoldParse. For example, are all labels added to the model? If you're training a named entity recognizer, also make sure that none of your annotated entity spans have leading or trailing whitespace or punctuation. You can also use the experimental `debug-data` command to validate your JSON-formatted training data. For details, run:
python -m spacy debug-data --help

I am using the following as main function:

def main(train_data,test_data=None,model=None,output_dir=None, n_iter=1):
    """Load the model, set up the pipeline and train the entity recognizer."""
    if model is not None:
        nlp = spacy.load(model)  # load existing spaCy model
        print("Loaded model '%s'" % model)
    else:
        nlp = spacy.blank("en")  # create blank Language class
        print("Created blank 'en' model")

    # create the built-in pipeline components and add them to the pipeline
    # nlp.create_pipe works for built-ins that are registered with spaCy
    if "ner" not in nlp.pipe_names:
        ner = nlp.create_pipe("ner")
        nlp.add_pipe(ner, last=True)
    # otherwise, get it so we can add labels
    else:
        ner = nlp.get_pipe("ner")

    # add labels
    for _, annotations in train_data:
        for ent in annotations.get("entities"):
            ner.add_label(ent[2])

    # get names of other pipes to disable them during training
    pipe_exceptions = ["ner", "trf_wordpiecer", "trf_tok2vec"]
    other_pipes = [pipe for pipe in nlp.pipe_names if pipe not in pipe_exceptions]
    # only train NER
    with nlp.disable_pipes(*other_pipes) and warnings.catch_warnings():
        # show warnings for misaligned entity spans once
        warnings.filterwarnings("once", category=UserWarning, module='spacy')

        # reset and initialize the weights randomly – but only if we're
        # training a new model

        optimizer = nlp.begin_training()
        for itn in range(n_iter):
            random.shuffle(train_data)
            losses = {}
            # batch up the examples using spaCy's minibatch
            #batches = minibatch(train_data)
            for text, annotations in train_data:
                nlp.update(
                    [text],  # batch of texts
                    [annotations],  # batch of annotations
                    drop=0.2,  # dropout - make it harder to memorise data
                    sgd=optimizer,  # callable to update weights
                    losses=losses)
            print(losses)
    '''
    # test the trained model
    for text, _ in test_data:
        doc = nlp(text)
        print("Entities", [(ent.text, ent.label_) for ent in doc.ents])
        print("Tokens", [(t.text, t.ent_type_, t.ent_iob) for t in doc])
    '''
    # save model to output directory
    if output_dir is not None:
        output_dir = Path(output_dir)
        if not output_dir.exists():
            output_dir.mkdir()
        nlp.to_disk(output_dir)
        print("Saved model to", output_dir)

        # test the saved model
        print("Loading from", output_dir)
        nlp2 = spacy.load(output_dir)
        for text, _ in train_data:
            doc = nlp2(text)
            print("Entities", [(ent.text, ent.label_) for ent in doc.ents])
            print("Tokens", [(t.text, t.ent_type_, t.ent_iob) for t in doc])

Your Environment

Operating System: Windows 10
Python Version Used: 3.7.7
spaCy Version Used: 2.2.4
Environment Information: Jupyter Notebook

Issue Analytics

State:
Created 3 years ago
Comments:6 (3 by maintainers)

Top GitHub Comments

1reaction

adrianeboydcommented, May 25, 2020

Maybe an example would be better:

text = "Jane was here."

# option 1: tokens + BILUO tags
words = ["Jane", "was", "here", "."]
entities = ["U-PERSON", "O", "O", "O"]
TRAIN_DATA = [("Jane was here.", {"words": words, "entities": entities})]

# option 2: character offsets for entities
entities = [(0, 4, "PERSON")]
assert text[0:4] == "Jane"
assert len("Jane") == 4 - 0
TRAIN_DATA = [("Jane was here.", {"entities": entities})]

The annotation in the second half of the training tuple corresponds to the arguments to GoldParse: https://spacy.io/api/goldparse#init

And a correction to what I said above: I said IOB tags, when it should have been BILUO tags.

1reaction

adrianeboydcommented, May 25, 2020

Hmm, I’m having a bit of trouble understanding your example data above. What entities are you labeling? Why is the sentence formatted without whitespace? The default English tokenizer isn’t going to be able to split this sentence into tokens so you can’t easily train a model from this kind of text for English.

If you want to provide ~~IOB~~ BILUO labels for each token instead of providing entity spans as character offsets, you should use a list of ~~IOB~~ BILUO tags like ~~["B-PER", "I-PER", "O", "O", ..., "B-LOC"]~~ ["B-PER", "L-PER", "O", "O", ..., "U-LOC"] instead of the (start, end, label) tuples. In this case, due to potential tokenization differences you should also provide a list of words as words: ["The", "words", "are", ..., "."] along with the entities in TRAIN_DATA.

If you use character spans, the span end character is exclusive (end-start should be the length of the span text). If you use character spans, you just label the entities and don’t label any O spans.

I think you may have started working from a similar example, but in case you haven’t seen the full script, here’s an example with English data: https://github.com/explosion/spaCy/blob/24ef6680fa0e52656f6f3108d0f2ddfeec142e7e/examples/training/train_ner.py

Edit: BILUO instead of IOB.

Top Results From Across the Web

Named Entity Recognition - Stanza - Stanford NLP Group

Named entities accessible through Document or Sentence 's properties entities or ents . Token-level NER tags accessible through Token 's properties ner ....

Named Entity Recognition (NER) Case Study | The Startup

In this article, we analyze a Named Entity Recognition (NER) case study. ... is used to identify labels at a word level, in...

Unsupervised NER using BERT. TL;DR - Towards Data Science

Given an input sentence to tag entities, very minimal preprocessing is done on input. One of them is casing normalization — sentences with...

Linguistic Features · spaCy Usage Documentation

It features NER, POS tagging, dependency parsing, word vectors and more. ... The parser also powers the sentence boundary detection, and lets you...

Named Entity Recognition Tagging - CS230 Deep Learning

We explore the problem of Named Entity Recognition (NER) tagging of sentences. The task is to tag each token in a given sentence...