Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

What format to use for training data and NER-model

See original GitHub issue

Hello,

I have been trying to train a model with the same method as #887 is using, just for a test case. I have a question, what would be the best format for a training corpus to import in spacy. I have a text-file with a list of of entities that requires new entities for tagging. Let me explain my case, I follow the update.training script like this:

 nlp = spacy.load('en_core_web_md', entity=False, parser=False)

ner= EntityRecognizer(nlp.vocab, entity_types=['FINANCE'])

for itn in range(5):
    random.shuffle(train_data)
    for raw_text, entity_offsets in train_data:
        doc = nlp.make_doc(raw_text)
        gold = GoldParse(doc, entities=entity_offsets)
 
        nlp.tagger(doc)
        ner.update(doc, gold)
ner.model.end_training()

I add my training data as entity_offsets:

train_data = [
    ('Monetary contracts are financial instruments between parties', [(23, 44, 'FINANCE')])
]

This is working fine for the one example and new entity tag. Obviously I want to be able to add more than one example. The Idea is to create a text file with tagged sentences, the question is what format does spacy needs for training data, should I keep with entity_offset from the examples (this will be a very tedious task for 1000’s of sentences) or is there another method to prepare the file, like:

financial instruments   FINANCE
contracts   FINANCE
Product OBJ
of O
Microsoft ORG
   etc ...

And how can I pass the corpus in spcay using the mentioned method? Do I have to use the new created model or can I add the new entities to the old model, how can this be achieved?

Thanks

Your Environment

spaCy version: 1.7.3
Platform: Windows-7-6.1.7601-SP1
Python version: 3.6.0
Installed models: en, en_core_web_md

Issue Analytics

State:
Created 6 years ago
Reactions:1
Comments:9 (1 by maintainers)

Top GitHub Comments

2reactions

inescommented, Apr 16, 2017

The new version 1.8.0 comes with bug fixes to the NER training procedure and a new save_to_directory() method. We’ve also updated the docs with more information on training and NER training in particular:

Workflow: Training the Named Entity Recognizer
Workflow: Saving and loading models
Example: Training an additional entity type
Command line interface for initialising, training and packaging models

I hope this helps!

1reaction

ramonrodcommented, Feb 1, 2018

Hi all, apparently there is no complete automated way how to this, at least not to my knowledge. I would recommend you to take a look at following packages (python-compatible):