question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

What format to use for training data and NER-model

See original GitHub issue

Hello,

I have been trying to train a model with the same method as #887 is using, just for a test case. I have a question, what would be the best format for a training corpus to import in spacy. I have a text-file with a list of of entities that requires new entities for tagging. Let me explain my case, I follow the update.training script like this:

 nlp = spacy.load('en_core_web_md', entity=False, parser=False)

ner= EntityRecognizer(nlp.vocab, entity_types=['FINANCE'])

for itn in range(5):
    random.shuffle(train_data)
    for raw_text, entity_offsets in train_data:
        doc = nlp.make_doc(raw_text)
        gold = GoldParse(doc, entities=entity_offsets)
 
        nlp.tagger(doc)
        ner.update(doc, gold)
ner.model.end_training()

I add my training data as entity_offsets:

train_data = [
    ('Monetary contracts are financial instruments between parties', [(23, 44, 'FINANCE')])
]

This is working fine for the one example and new entity tag. Obviously I want to be able to add more than one example. The Idea is to create a text file with tagged sentences, the question is what format does spacy needs for training data, should I keep with entity_offset from the examples (this will be a very tedious task for 1000’s of sentences) or is there another method to prepare the file, like:

financial instruments   FINANCE
contracts   FINANCE
Product OBJ
of O
Microsoft ORG
   etc ...

And how can I pass the corpus in spcay using the mentioned method? Do I have to use the new created model or can I add the new entities to the old model, how can this be achieved?

Thanks

Your Environment

  • spaCy version: 1.7.3
  • Platform: Windows-7-6.1.7601-SP1
  • Python version: 3.6.0
  • Installed models: en, en_core_web_md

Issue Analytics

  • State:closed
  • Created 6 years ago
  • Reactions:1
  • Comments:9 (1 by maintainers)

github_iconTop GitHub Comments

2reactions
inescommented, Apr 16, 2017

The new version 1.8.0 comes with bug fixes to the NER training procedure and a new save_to_directory() method. We’ve also updated the docs with more information on training and NER training in particular:

I hope this helps!

1reaction
ramonrodcommented, Feb 1, 2018

Hi all, apparently there is no complete automated way how to this, at least not to my knowledge. I would recommend you to take a look at following packages (python-compatible):

Hope this can help you.

Read more comments on GitHub >

github_iconTop Results From Across the Web

Prepare training data and train custom NER using Spacy ...
Now let's start coding to create final Spacy formatted custom training data to train custom Named Entity Recognition (NER) model using Spacy and...
Read more >
Training Custom NER models in SpaCy to auto-detect ...
Format of the training examples. spaCy accepts training data as list of tuples. Each tuple should contain the text and a dictionary.
Read more >
NER Data Formats
Train data can be in the form of a Pandas DataFrame or in a CoNLL style formatted text file.
Read more >
7. How to Train spaCy NER Model
In order to train a machine learning model, the first thing that we need to do is to create a spaCy binary object...
Read more >
Data formats · spaCy API Documentation
Config files define the training process and pipeline and can be passed to spacy train . They use Thinc's configuration system under the...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found