question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

SpaCy NER training example from version 1.5.0 doesn't work in 1.6.0

See original GitHub issue

I tried to use the training example here:

https://github.com/explosion/spaCy/blob/master/examples/training/train_ner.py

with SpaCy 1.6.0. I get results like this:

Who is Shaka Khan?
Who 1228 554 WP  2
is 474 474 VBZ PERSON 3
Shaka 57550 129921 NNP PERSON 1
Khan 12535 48600 NNP LOC 3
? 482 482 . LOC 3

I like London and Berlin
I 467 570 PRP LOC 3
like 502 502 VBP LOC 1
London 4003 24340 NNP LOC 3
and 470 470 CC PERSON 3
Berlin 11964 60816 NNP PERSON 1

The tagging is odd, and from Khan is recognized as a LOC and Berlin as a PERSON. If I back up to version 1.5.0, the result is as expected:

Who is Shaka Khan?
Who 1228 554 WP  2
is 474 474 VBZ  2
Shaka 57550 129921 NNP PERSON 3
Khan 12535 48600 NNP PERSON 1
? 482 482 .  2

I like London and Berlin
I 467 570 PRP  2
like 502 502 VBP  2
London 4003 24340 NNP LOC 3
and 470 470 CC  2
Berlin 11964 60816 NNP LOC 3

Could this be an issue with the off the shelf English model that spacy.en.download 1.6.0 fetched?

Issue Analytics

  • State:closed
  • Created 7 years ago
  • Comments:7 (5 by maintainers)

github_iconTop GitHub Comments

6reactions
honnibalcommented, Jan 27, 2017

TL;DR

I made a bug fix to thinc for 1.6 that’s messed up the example, as it’s written.

The best fix is to not call .end_training() after updating the model. I’m working on making this less confusing.

What’s going on

spaCy 1.x uses the Averaged Perceptron algorithm for all its machine learning. You can read about the algorithm in the POS tagger blog post, where you can also find a straight-forward Python implementation: https://explosion.ai/blog/part-of-speech-pos-tagger-in-python

AP uses the Averaged Parameter Trick for SGD. There are two copies of the weights:

  1. The current weights,
  2. The averaged weights

During training predictions are made with the current weights, and the averaged weights are updated in the background. At the end of training, we swap the current for the averages. This makes a huge difference for most training scenarios.

However, when I wrote the code, I didn’t pay much attention to the current use-case of “resuming” training, in order to add another class. I recently fixed a long-standing error in the averaged perceptron code:

After loading a model, Thinc was not initialising the averages to the newly loaded weights. This saves memory, because the averages require another copy of the weights, and also additional book-keeping. The consequence of this bug was that when you updated a feature after resuming training, you wiped the weights that were previously associated with it. This is really bad — it means that as you train new examples, you’re deleting all the information previously associated with it.

I finally fixed this bug in this commit: https://github.com/explosion/thinc/commit/09b030b4aa0e58fd3eef0eda5340795fd079b248

The consequence of this is that the correction makes the model behave differently on these small-data example cases.

What’s still unclear is, how should we compute an average between the old weights and the new ones? The old weights were trained on about 20 passes over about 80,000 sentences of annotation. So the new 5 passes over 5 examples shouldn’t change the weights at all if we take an unbiased average. This seems undesirable.

If you have so little data, it’s probably not a good idea to average.

About NER and training more generally (making this the megathread)

#762 , #612 , #701, #665 . Attn: @savvopoulos, @viksit

People are having a lot of pain with training the NER system. Some of the problems are easy to fix — the current workflow around saving and loading data is pretty bad, and it’s made worse by some Python 2/3 unicode save/load bugs in the example scripts.

What’s hard to solve is that people seem to want to train the NER system on like, 5 examples. The current algorithm expects more like 5,000. I realise I never wrote this anywhere, and the examples all show five examples. I guess I’ve been doing this stuff too long, and it’s no longer obvious to me what is and isn’t obvious. I think has been the root cause of a lot of confusion.

Things will improve with spaCy 2.0 a little bit. You might be able to get a useful model with as little as 500 or 1,000 sentences annotated with a new NER class. Maybe.

We’re working on ways to make all of this more efficient. We’re working on making annotation projects less expensive and more consistent, and we’re working on algorithms that require fewer annotated examples. But there will always be limits.

The thing is…I think most teams should be annotating literally 10,000x as much data as they’re currently trying to get away with. You should have at least 1,000 sentences just of evaluation data, that your machine learning model never sees. Otherwise how will you know that your system is working? By typing stuff into it, manually? You wouldn’t test your other code like that, would you? 😃

2reactions
badbyecommented, Mar 8, 2017

@honnibal Thanks for your explanation.

Currently, the example code of training and updating NER in the document only use 2 sentences, which is obviously not enough (I realize it after reading your comment).

I think if you put your explanation in the document, that will be better. Everyone tries to read the doc to learn something, they go to the issues only if they could not find what they want in the doc.

More problems about the example code

  1. How to use the updated NER model? Update: find an example here: https://spacy.io/docs/usage/training#train-entity

  2. Seems the example is trying to retrain a NER model, not update the original one?

>>> # after running the example code, it does not work
>>> nlp(u'Who is Chaka Khan?').ents
()
Read more comments on GitHub >

github_iconTop Results From Across the Web

Training Pipelines & Models · spaCy Usage Documentation
Train and update components on your own data and integrate custom models.
Read more >
Older versions of spaCy throws "KeyError: 'package'" error ...
Older versions of spaCy throws "KeyError: 'package'" error when trying to install a model ; :~/NeuroNER-master/src$ python3.5 ; "/usr/lib/python3.
Read more >
pip install spacy==1.6.0 - PyPI
spaCy is a library for advanced natural language processing in Python and Cython. spaCy is built on the very latest research, but it...
Read more >
Rasa Open Source Change Log
This breaks backward compatibility of previously trained models. It is not possible to load models trained with previous versions of Rasa Open ......
Read more >
7. How to Train spaCy NER Model
In this notebook, we will not be interested in the refining of this training set, rather the use of it to train a...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found