Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Entity extracted at evaluation doesn't show up using the imported model

See original GitHub issue

Hi,

Trained a custom NER model on a our own labelled dataset on financial risk. Did pretraining as well and the model finished with a score of 97.48.

Training pipeline: ner
Starting with blank model 'en'
2511 training docs
267 evaluation docs

============================== Vocab & Vectors ==============================
ℹ 101601 total words in the data (12951 unique)
ℹ No word vectors present in the model

========================== Named Entity Recognition ==========================
ℹ 1 new label, 0 existing labels
0 missing values (tokens with '-' label)
✔ Good amount of examples for all labels
✔ Examples without occurrences available for all labels
✔ No entities consisting of or starting/ending with whitespace
✔ No entities consisting of or starting/ending with punctuation

When evaluating the model on the dev set, entities got picked up just fine, but there is one entity: US Treasury Department’s Office of Foreign Assets Control (at least one I’ve notices) that doesn’t show up in the same sentences when importing and testing the best-model in a notebook:

Screen Shot 2020-08-24 at 08 26 29

Screen Shot 2020-08-24 at 08 29 03

Than I ran a test on every single sentence (150) containing the missing entity, 3 returned it partially as: Department’s Office of Foreign Assets Control but nothing more.

There are quite a few other entities like: Department of Justice (130), Department of State (70), US Department of the Treasury (40) which contain similar wording, can these potentially conflict the missing entity: US Treasury Department’s Office of Foreign Assets Control? However this still won’t answer why this is present in the evaluation sample but missing in production.

Btw, there’s a permutation of the missing entity: US Treasury’s Office of Foreign Assets Control which pops up perfectly in any tested sentence, which puzzles me even more.

Using latest version of Spacy. Thanks.

Issue Analytics

State:
Created 3 years ago
Comments:18 (9 by maintainers)

Top GitHub Comments

1reaction

vedtamcommented, Oct 13, 2020

Hi @svlandeg, I appreciate your help so much! Yes, this makes sense and will help us in preparing our data in such a manner that’s consistent from training to production. Can’t wait to dive in and do some refactoring.

Thanks again!

1reaction

svlandegcommented, Sep 2, 2020

Yes, I received it, thanks!