question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

train_ner.py train data format to spaCy's json

See original GitHub issue

Hi, I’m trying to use the CLI train command to train a NER model. I was able to train it following the example from train_ner.py on which the data needed to be formatted like this:

TRAIN_DATA = [
    ("Who is Shaka Khan?", {"entities": [(7, 17, "PERSON")]}),
    ("I like London and Berlin.", {"entities": [(7, 13, "LOC"), (18, 24, "LOC")]}),
]

I now want to use the more powerful CLI.train command, but I have all my data in the format above, is there an existing script for this conversion? As far as I can see this isn’t supported by CLI.convert

Thanks.

Your Environment

  • spaCy version: 2.2.4
  • Platform: Linux-4.19.104±x86_64-with-Ubuntu-18.04-bionic
  • Python version: 3.6.9
  • Models: en, es

Issue Analytics

  • State:closed
  • Created 3 years ago
  • Comments:5 (4 by maintainers)

github_iconTop GitHub Comments

1reaction
svlandegcommented, Aug 20, 2020

Wouldn’t it make sense to add it into the CLI convert script as another supported format?

You’re right that this has been lacking. For spaCy v.3, we’re working on an overhaul of the convert function and the data formats in general, which should hopefully make all of this more intuitive!

1reaction
adrianeboydcommented, Jun 19, 2020

Here’s my stackoverflow answer on how to do this: https://stackoverflow.com/a/59209377/461847

It would probably make sense to add an example script to do this, since this is the main missing step for people who want to move from the super simple example training scripts to real training with the train CLI.

Read more comments on GitHub >

github_iconTop Results From Across the Web

How to train a NER model using spaCy 3 only, starting from ...
without the need of using prodigy, just spaCy 3; handing to them JSON/JSONL as "raw" training data, rather than binary .spacy files.
Read more >
Data formats · spaCy API Documentation
This section documents input and output formats of data used by spaCy, including the training config, training data and lexical vocabulary data.
Read more >
Creates NER training data in Spacy format from JSON ...
Creates NER training data in Spacy format from JSON downloaded from Dataturks. ... Run: python Dataturks_to_Spacy.py <dataturks_JSON_FilePath> ...
Read more >
spacy training data to be used in Python - moved from JSON
spacy format using the convert command line. The problem I am facing now is that the old code: for text, annotations in TEST_DATA:...
Read more >
Prepare training data and train custom NER using Spacy Python
Prepare Spacy formatted training data for custom NER ####### import json # Read output json file from WebAnno (Annotation tool) with open('input_json.json') ...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found