What format to use for training data and NER-model
See original GitHub issueHello,
I have been trying to train a model with the same method as #887 is using, just for a test case. I have a question, what would be the best format for a training corpus to import in spacy. I have a text-file with a list of of entities that requires new entities for tagging. Let me explain my case, I follow the update.training script like this:
nlp = spacy.load('en_core_web_md', entity=False, parser=False)
ner= EntityRecognizer(nlp.vocab, entity_types=['FINANCE'])
for itn in range(5):
random.shuffle(train_data)
for raw_text, entity_offsets in train_data:
doc = nlp.make_doc(raw_text)
gold = GoldParse(doc, entities=entity_offsets)
nlp.tagger(doc)
ner.update(doc, gold)
ner.model.end_training()
I add my training data as entity_offsets:
train_data = [
('Monetary contracts are financial instruments between parties', [(23, 44, 'FINANCE')])
]
This is working fine for the one example and new entity tag. Obviously I want to be able to add more than one example. The Idea is to create a text file with tagged sentences, the question is what format does spacy needs for training data, should I keep with entity_offset from the examples (this will be a very tedious task for 1000’s of sentences) or is there another method to prepare the file, like:
financial instruments FINANCE
contracts FINANCE
Product OBJ
of O
Microsoft ORG
etc ...
And how can I pass the corpus in spcay using the mentioned method? Do I have to use the new created model or can I add the new entities to the old model, how can this be achieved?
Thanks
Your Environment
- spaCy version: 1.7.3
- Platform: Windows-7-6.1.7601-SP1
- Python version: 3.6.0
- Installed models: en, en_core_web_md
Issue Analytics
- State:
- Created 6 years ago
- Reactions:1
- Comments:9 (1 by maintainers)
Top GitHub Comments
The new version 1.8.0 comes with bug fixes to the NER training procedure and a new
save_to_directory()
method. We’ve also updated the docs with more information on training and NER training in particular:I hope this helps!
Hi all, apparently there is no complete automated way how to this, at least not to my knowledge. I would recommend you to take a look at following packages (python-compatible):
Hope this can help you.