question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

train_new_entity_type.py is inter-mixing doc and entity_offset data with BILOU format training data

See original GitHub issue

Bug in the train new entity type

I am trying add new entity named technology and i have annotated data in BILOU format to train with the help of train new entity type.

Originial example is trained with entity offset format but this can also be trained in BILOU format as explained.

My Training Data in BILOU format is same as given below ` train_data = [

    ('Play a key role in developing high - performance , scalable and fault - tolerant applications written in Java ',
    ['O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'U-technology']),

    ('Focus around building custom APIs for a variety of connected devises ',
     ['O', 'O', 'O', 'O', 'U-technology', 'O', 'O', 'O', 'O', 'O', 'O'])
]`

Complete Training Script with BILOU format

from __future__ import unicode_literals, print_function
import spacy
from spacy.gold import GoldParse
import random
from pathlib import Path


def train_ner(nlp, train_data, output_dir):
    # Add new words to vocab
    for raw_text, _ in train_data:
        doc = nlp.make_doc(raw_text)
        for word in doc:
            _ = nlp.vocab[word.orth]
        print(doc,len(doc))

    for itn in range(20):
        random.shuffle(train_data)
        for raw_text, entity_offsets in train_data:
            print(doc, "token length", len(doc), "length of entity", len(entity_offsets),id(doc))
            print(entity_offsets)
            # doc = nlp.make_doc(raw_text)  #position of doc object assignment is
            gold = GoldParse(doc, entities=entity_offsets) #loc is refefrenced before assignment
            doc = nlp.make_doc(raw_text)
            nlp.tagger(doc)
            loss = nlp.entity.update(doc, gold)
    nlp.end_training()
    if output_dir:
        nlp.save_to_directory(output_dir)



def main(model_name, output_directory=None):
    nlp = spacy.load(model_name)
    if output_directory is not None:
        output_directory = Path(output_directory)
    #
    train_data = [

        ('Play a key role in developing high - performance , scalable and fault - tolerant applications written in Java ',
        ['O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'U-technology']),

        ('Focus around building custom APIs for a variety of connected devises ',
         ['O', 'O', 'O', 'O', 'U-technology', 'O', 'O', 'O', 'O', 'O', 'O'])
    ]
    nlp.entity.add_label('technology')
    ner = train_ner(nlp, train_data, output_directory)

if __name__ == '__main__':

    output_directory = "example_model"
    main('en',output_directory)

The above example is throwing error which is given below

gold = GoldParse(doc, entities=entity_offsets) #loc is refefrenced before assignment
File "spacy/gold.pyx", line 294, in spacy.gold.GoldParse.__init__ (spacy/gold.cpp:10834)
IndexError: list index out of range

Above issue could be fixed by changing the position of ‘doc = nlp.make_doc(raw_text)’ to declare before gold parse object which is currently declared after gold parse object and this creates issue of Local variable might be referenced before assignment


Complete Training Script with entity offset format

The below example is working fine when training data is entity offset format.

from __future__ import unicode_literals, print_function
import spacy
from spacy.gold import GoldParse
import random
from pathlib import Path


def train_ner(nlp, train_data, output_dir):
    # Add new words to vocab
    for raw_text, _ in train_data:
        doc = nlp.make_doc(raw_text)
        for word in doc:
            _ = nlp.vocab[word.orth]
        print(doc,len(doc))

    for itn in range(20):
        random.shuffle(train_data)
        for raw_text, entity_offsets in train_data:
            print(doc, "token length", len(doc), "length of entity", len(entity_offsets),id(doc))
            print(entity_offsets)
            # doc = nlp.make_doc(raw_text)  #position of doc object assignment is
            gold = GoldParse(doc, entities=entity_offsets) #loc is refefrenced before assignment
            doc = nlp.make_doc(raw_text)
            nlp.tagger(doc)
            loss = nlp.entity.update(doc, gold)
    nlp.end_training()
    if output_dir:
        nlp.save_to_directory(output_dir)



def main(model_name, output_directory=None):
    nlp = spacy.load(model_name)
    if output_directory is not None:
        output_directory = Path(output_directory)

    train_data = [
         (
         'Play a key role in developing high - performance , scalable and fault - tolerant applications written in Java ',
         [(105,109,'technology')]),
    
         ('Focus around building custom APIs for a variety of connected devises ',
          [(29,33,'technology')])
     ]

    nlp.entity.add_label('technology')
    ner = train_ner(nlp, train_data, output_directory)


if __name__ == '__main__':

    output_directory = "example_model"
    main('en',output_directory)

Your Environment

  • Operating System: Ubuntu 14.04
  • Python Version Used: 3.5.2
  • spaCy Version Used: 1.8.0
  • Environment Information: Pycharm

Issue Analytics

  • State:closed
  • Created 6 years ago
  • Comments:6 (2 by maintainers)

github_iconTop GitHub Comments

1reaction
honnibalcommented, Apr 23, 2017

@irfan-zoefit I think the mistake is that

gold = GoldParse(doc, offsets)

Should be

gold = GoldParse(doc, entities=offsets)
0reactions
lock[bot]commented, May 8, 2018

This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.

Read more comments on GitHub >

github_iconTop Results From Across the Web

Data formats · spaCy API Documentation
This section documents input and output formats of data used by spaCy, including the training config, training data and lexical vocabulary data.
Read more >
Training Custom NER models in SpaCy to auto-detect named ...
The spaCy library allows you to train NER models by both updating an existing spacy ... The format of the training data is...
Read more >
How to convert simple NER format to spacy json #1966 - GitHub
Currently I have data in the simple training data / offset format as shown in the docs: TRAIN_DATA = [ ('Who is Shaka...
Read more >
Construct a Doc object with labeled offset data - Stack Overflow
I have a labeled dataset in the spacy offset format. My goal is to convert the format to the CONLL column format. Is...
Read more >
Extend Named Entity Recogniser (NER) to label new entities ...
One of the ways is to create a spaCy doc form text string and save ... Here is how you can label your...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found