question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. ItĀ collects links to all the places you might be looking at while hunting down a tough bug.

And, if youā€™re still stuck at the end, weā€™re happy to hop on a call to see how we can help out.

Several problems with named entites predicted with the ner pipeline

See original GitHub issue

šŸ› Bug

Information

Hello,

I am using the bert-base-cased model to predict named entities for a bunch of sentences (around 29 900). I am facing 3 main issues :

  1. Residual ā€˜##ā€™ in grouped entitiesā€™ word field (So they are not well grouped)
  2. [UNK] (or [CLS]) tokens inside word fields
  3. Missing syllables in the word fields

Model I am using (Bert, XLNet ā€¦): Bert (dbmdz/bert-large-cased-finetuned-conll03-english)

Language I am using the model on (English, Chinese ā€¦): English

The problem arises when using:

  • the official example scripts: (give details below)
  • my own modified scripts: (give details below)

The tasks I am working on is:

  • an official GLUE/SQUaD task: (give the name)
  • my own task or dataset: NER with my own unlabelled dataset

To reproduce

I didnā€™t find the official example for this so I made my own script with the TokenClassificationPipeline :

import torch
from transformers import AutoModelForTokenClassification, AutoTokenizer
from transformers import TokenClassificationPipeline

model = AutoModelForTokenClassification.from_pretrained("dbmdz/bert-large-cased-finetuned-conll03-english")
tokenizer = AutoTokenizer.from_pretrained("bert-base-cased")

nlp_not_grouped = TokenClassificationPipeline(
    model=model,
    tokenizer=tokenizer,
    grouped_entities=False
)

nlp_grouped = TokenClassificationPipeline(
    model=model,
    tokenizer=tokenizer,
    grouped_entities=True
)

seq1 = "Hugging Face Inc. is a company based in New York City. Its headquarters are in DUMBO, therefore very" \
           "close to the Manhattan Bridge."

seq2 = "In addition , the Blabla Group has completed the acquisition of ISO / TS16949 certification ."

seq3 = "Product sales to the PSA Peugeot CitroƃĀ«n group totaled Ć¢ā€šĀ¬ 1 , 893 . 6 million in 2012 , down 8 . 1 % "\
"on a reported basis and 10 . 4 % on a like - for - like basis ."

seq4 = "To prepare as best as possible the decisions falling under its responsibilities , Faurecia Ć¢ā‚¬ā„¢ s Board of"\
" Directors has set up three committees : c Audit Committee ; c Strategy Committee ; c Appointments and Compensation"\
" Committee ."

sequences = [seq1, seq2, seq3, seq4]

for i, seq in enumerate(sequences):
    ngrouped, grouped = nlp_not_grouped(seq), nlp_grouped(seq)
    print(f"===================== sentence nĀ°{i+1}")
    print("---Sentence---")
    print(seq)
    print("---Not grouped entities---")
    for ngent in ngrouped:
        print(ngent)
    print("---Grouped entities---")
    for gent in grouped:
        print(gent)
    

I have about 29 900 sentences. For each sentence I want to predict all the named entities in it and then locate them in the sentence. Once I have an entity, I use a regex to find it in the original sentence (before the tokenization step) like this :

start, stop = re.search(re.escape(ent['word']), sent).span()

Where ent['word'] is the text of an entity found in a sentence. For instance, it can be "London" for the sentence (sent) "London is really a great city". However I do this later with the grouped entities but since there are errors in it many are discarded because re.search() raises an exception (that I catch).

Steps to reproduce the behavior:

You just have to run my script to predict the entities for the four sentences. Here is what I get :

===================== sentence nĀ°1
---Sentence---
Hugging Face Inc. is a company based in New York City. Its headquarters are in DUMBO, therefore veryclose to the Manhattan Bridge.
---Not grouped entities---
{'word': 'Hu', 'score': 0.9995108246803284, 'entity': 'I-ORG', 'index': 1}
{'word': '##gging', 'score': 0.989597499370575, 'entity': 'I-ORG', 'index': 2}
{'word': 'Face', 'score': 0.9979704022407532, 'entity': 'I-ORG', 'index': 3}
{'word': 'Inc', 'score': 0.9993758797645569, 'entity': 'I-ORG', 'index': 4}
{'word': 'New', 'score': 0.9993405938148499, 'entity': 'I-LOC', 'index': 11}
{'word': 'York', 'score': 0.9991927742958069, 'entity': 'I-LOC', 'index': 12}
{'word': 'City', 'score': 0.9993411302566528, 'entity': 'I-LOC', 'index': 13}
{'word': 'D', 'score': 0.986336350440979, 'entity': 'I-LOC', 'index': 19}
{'word': '##UM', 'score': 0.9396238923072815, 'entity': 'I-LOC', 'index': 20}
{'word': '##BO', 'score': 0.9121386408805847, 'entity': 'I-LOC', 'index': 21}
{'word': 'Manhattan', 'score': 0.9839190244674683, 'entity': 'I-LOC', 'index': 29}
{'word': 'Bridge', 'score': 0.9924242496490479, 'entity': 'I-LOC', 'index': 30}
---Grouped entities---
{'entity_group': 'I-ORG', 'score': 0.9966136515140533, 'word': 'Hugging Face Inc'}
{'entity_group': 'I-LOC', 'score': 0.9992914994557699, 'word': 'New York City'}
{'entity_group': 'I-LOC', 'score': 0.9460329612096151, 'word': 'DUMBO'}
{'entity_group': 'I-LOC', 'score': 0.9881716370582581, 'word': 'Manhattan Bridge'}
===================== sentence nĀ°2
---Sentence---
In addition , the Blabla Group has completed the acquisition of ISO / TS16949 certification .
---Not grouped entities---
{'word': 'B', 'score': 0.9997261762619019, 'entity': 'I-ORG', 'index': 5}
{'word': '##la', 'score': 0.997683048248291, 'entity': 'I-ORG', 'index': 6}
{'word': '##bla', 'score': 0.99888014793396, 'entity': 'I-ORG', 'index': 7}
{'word': 'Group', 'score': 0.9992784261703491, 'entity': 'I-ORG', 'index': 8}
{'word': 'ISO', 'score': 0.9711909890174866, 'entity': 'I-MISC', 'index': 14}
{'word': 'T', 'score': 0.6591967344284058, 'entity': 'I-ORG', 'index': 16}
{'word': '##S', 'score': 0.658642053604126, 'entity': 'I-MISC', 'index': 17}
{'word': '##16', 'score': 0.5059574842453003, 'entity': 'I-MISC', 'index': 18}
{'word': '##9', 'score': 0.5067382454872131, 'entity': 'I-MISC', 'index': 21}
---Grouped entities---
{'entity_group': 'I-ORG', 'score': 0.9988919496536255, 'word': 'Blabla Group'}
{'entity_group': 'I-MISC', 'score': 0.9711909890174866, 'word': 'ISO'}
{'entity_group': 'I-ORG', 'score': 0.6591967344284058, 'word': 'T'}
{'entity_group': 'I-MISC', 'score': 0.5822997689247131, 'word': '##S16'}
===================== sentence nĀ°3
---Sentence---
Product sales to the PSA Peugeot CitroƃĀ«n group totaled Ć¢ā€šĀ¬ 1 , 893 . 6 million in 2012 , down 8 . 1 % on a reported basis and 10 . 4 % on a like - for - like basis .
---Not grouped entities---
{'word': 'PS', 'score': 0.9970256686210632, 'entity': 'I-ORG', 'index': 5}
{'word': '##A', 'score': 0.9927457571029663, 'entity': 'I-ORG', 'index': 6}
{'word': 'P', 'score': 0.9980151653289795, 'entity': 'I-ORG', 'index': 7}
{'word': '##eu', 'score': 0.9897757768630981, 'entity': 'I-ORG', 'index': 8}
{'word': '##ge', 'score': 0.996147871017456, 'entity': 'I-ORG', 'index': 9}
{'word': '##ot', 'score': 0.9928787350654602, 'entity': 'I-ORG', 'index': 10}
{'word': '[UNK]', 'score': 0.5744695067405701, 'entity': 'I-ORG', 'index': 11}
---Grouped entities---
{'entity_group': 'I-ORG', 'score': 0.934436925819942, 'word': 'PSA Peugeot [UNK]'}
===================== sentence nĀ°4
---Sentence---
To prepare as best as possible the decisions falling under its responsibilities , Faurecia Ć¢ā‚¬ā„¢ s Board of Directors has set up three committees : c Audit Committee ; c Strategy Committee ; c Appointments and Compensation Committee .
---Not grouped entities---
{'word': 'F', 'score': 0.9983997941017151, 'entity': 'I-ORG', 'index': 14}
{'word': '##au', 'score': 0.9473735690116882, 'entity': 'I-ORG', 'index': 15}
{'word': '##re', 'score': 0.9604568481445312, 'entity': 'I-ORG', 'index': 16}
{'word': '##cia', 'score': 0.992807149887085, 'entity': 'I-ORG', 'index': 17}
{'word': 'Board', 'score': 0.8452167510986328, 'entity': 'I-ORG', 'index': 20}
{'word': 'of', 'score': 0.5921975374221802, 'entity': 'I-ORG', 'index': 21}
{'word': 'Directors', 'score': 0.6778028607368469, 'entity': 'I-ORG', 'index': 22}
{'word': 'Audi', 'score': 0.9764850735664368, 'entity': 'I-ORG', 'index': 30}
{'word': '##t', 'score': 0.9692177772521973, 'entity': 'I-ORG', 'index': 31}
{'word': 'Committee', 'score': 0.9959701299667358, 'entity': 'I-ORG', 'index': 32}
{'word': 'Strategy', 'score': 0.9705951809883118, 'entity': 'I-ORG', 'index': 35}
{'word': 'Committee', 'score': 0.994032621383667, 'entity': 'I-ORG', 'index': 36}
{'word': 'A', 'score': 0.9764854907989502, 'entity': 'I-ORG', 'index': 39}
{'word': '##oint', 'score': 0.7803319692611694, 'entity': 'I-ORG', 'index': 41}
{'word': '##ments', 'score': 0.7828453779220581, 'entity': 'I-ORG', 'index': 42}
{'word': 'and', 'score': 0.9625542163848877, 'entity': 'I-ORG', 'index': 43}
{'word': 'Co', 'score': 0.9904180765151978, 'entity': 'I-ORG', 'index': 44}
{'word': '##mp', 'score': 0.9140805602073669, 'entity': 'I-ORG', 'index': 45}
{'word': '##ens', 'score': 0.8661588430404663, 'entity': 'I-ORG', 'index': 46}
{'word': '##ation', 'score': 0.9150537252426147, 'entity': 'I-ORG', 'index': 47}
{'word': 'Committee', 'score': 0.9888517260551453, 'entity': 'I-ORG', 'index': 48}
---Grouped entities---
{'entity_group': 'I-ORG', 'score': 0.9747593402862549, 'word': 'Faurecia'}
{'entity_group': 'I-ORG', 'score': 0.7050723830858866, 'word': 'Board of Directors'}
{'entity_group': 'I-ORG', 'score': 0.9805576602617899, 'word': 'Audit Committee'}
{'entity_group': 'I-ORG', 'score': 0.9823139011859894, 'word': 'Strategy Committee'}
{'entity_group': 'I-ORG', 'score': 0.9764854907989502, 'word': 'A'}
{'entity_group': 'I-ORG', 'score': 0.9000368118286133, 'word': '##ointments and Compensation Committee'}

Expected behavior

For the first sentence (seq1) everything is fine. Itā€™s the example of the NER section under Usage section of the documentation : https://huggingface.co/transformers/usage.html#named-entity-recognition

With the other sentences we can see one example of each problem :

Residual ā€˜##ā€™ in word pieces

{'entity_group': 'I-MISC', 'score': 0.9711909890174866, 'word': 'ISO'}
{'entity_group': 'I-ORG', 'score': 0.6591967344284058, 'word': 'T'}
{'entity_group': 'I-MISC', 'score': 0.5822997689247131, 'word': '##S16'}

In seq 2, there is '##S16' as a word. Obviously, it should have been grouped with the precending entity and form TS16 even maybe 'ISO / TS16949' like this :

{'entity_group': 'I-MISC', 'score': 0.9711909890174866, 'word': 'ISO / TS16949'}

[UNK] tokens in the word field

{'entity_group': 'I-ORG', 'score': 0.934436925819942, 'word': 'PSA Peugeot [UNK]'}

Because maybe of the ugly written CitroƃĀ«n which stands for CitroĆ«n. The entity found is 'PSA Peugeot [UNK]'. In this case it would be better to just put 'PSA Peugeot' if the last token is identified as [UNK] :

{'entity_group': 'I-ORG', 'score': 0.934436925819942, 'word': 'PSA Peugeot'}

Syllables lost

For the last sentence we can see that ā€˜Appointments and Compensation Committeeā€™ as be splitted into :

{'entity_group': 'I-ORG', 'score': 0.9764854907989502, 'word': 'A'}
{'entity_group': 'I-ORG', 'score': 0.9000368118286133, 'word': '##ointments and Compensation Committee'}

instead of :

{'entity_group': 'I-ORG', 'score': 0.9000368118286133, 'word': 'Appointments and Compensation Committee'}

The entity is not well grouped but more importantly the ā€˜ppā€™ is missing so even if we decided to blend the two groups we wouldnā€™t get the real entity. This problem was first raised here : #4816. Iā€™ve actually encountered this problem trying to fix the first one : I noticed some entity grouped like this, miss some syllables. The pipeline with grouped_entity=False already lost the ā€˜ppā€™ :

{'word': 'A', 'score': 0.9764854907989502, 'entity': 'I-ORG', 'index': 39}
{'word': '##oint', 'score': 0.7803319692611694, 'entity': 'I-ORG', 'index': 41}
{'word': '##ments', 'score': 0.7828453779220581, 'entity': 'I-ORG', 'index': 42}

It seems the way the pipeline blends each tokens is not ok because when I predict the label for each tokens with the code example of the documentation, I get this :

[('[CLS]', 'O'), ('To', 'O'), ('prepare', 'O'), ('as', 'O'), ('best', 'O'), ('as', 'I-ORG'), ('possible', 'I-ORG'), ('the', 'I-ORG'), ('decisions', 'I-ORG'), ('falling', 'I-ORG'), ('under', 'I-ORG'), ('its', 'I-ORG'), ('responsibilities', 'O'), (',', 'O'), ('F', 'O'), ('##au', 'O'), ('##re', 'O'), ('##cia', 'O'), ('[UNK]', 'O'), ('s', 'O'), ('Board', 'O'), ('of', 'O'), ('Directors', 'O'), ('has', 'O'), ('set', 'O'), ('up', 'O'), ('three', 'O'), ('committees', 'O'), (':', 'O'), ('c', 'O'), ('Audi', 'O'), ('##t', 'O'), ('Committee', 'O'), (';', 'O'), ('c', 'O'), ('Strategy', 'O'), ('Committee', 'O'), (';', 'O'), ('c', 'O'), ('A', 'O'), ('##pp', 'O'), ('##oint', 'O'), ('##ments', 'O'), ('and', 'O'), ('Co', 'O'), ('##mp', 'O'), ('##ens', 'O'), ('##ation', 'O'), ('Committee', 'O'), ('.', 'O'), ('[SEP]', 'O')]

There are those tokens :

('A', 'O'), ('##pp', 'O'), ('##oint', 'O'), ('##ments', 'O') for ā€˜Appointmentsā€™

Environment info

  • transformers version: 2.11.0
  • Platform: Windows-10-10.0.18362-SP0
  • Python version: 3.7.6
  • PyTorch version (GPU?): 1.5.0+cpu (False)
  • Tensorflow version (GPU?): 2.2.0 (False)
  • Using GPU in script?: False
  • Using distributed or parallel set-up in script?: False

EDIT : Typos

Issue Analytics

  • State:closed
  • Created 3 years ago
  • Reactions:2
  • Comments:14 (6 by maintainers)

github_iconTop GitHub Comments

2reactions
dav009commented, Jun 17, 2020

some interesting finding:

Using a fast tokenizer solves the [UNK] issue. using one of your provided examples:

model = AutoModelForTokenClassification.from_pretrained("dbmdz/bert-large-cased-finetuned-conll03-english")
tokenizer = AutoTokenizer.from_pretrained("bert-base-cased", use_fast=True)
nlp = TokenClassificationPipeline(model=model,
      tokenizer=tokenizer,
      grouped_entities=False)

t="Product sales to the PSA Peugeot CitroƃĀ«n group totaled Ć¢ā€šĀ¬ 1 , 893 . 6 million in 2012 , down 8 . 1 %  on a reported basis and 10 . 4 % on a like - for - like basis ."

nlp(t)
[{'word': 'PS', 'score': 0.9961145520210266, 'entity': 'I-ORG', 'index': 5},
 {'word': '##A', 'score': 0.9905584454536438, 'entity': 'I-ORG', 'index': 6},
 {'word': 'P', 'score': 0.997616708278656, 'entity': 'I-ORG', 'index': 7},
 {'word': '##eu', 'score': 0.9741767644882202, 'entity': 'I-ORG', 'index': 8},
 {'word': '##ge', 'score': 0.9928027391433716, 'entity': 'I-ORG', 'index': 9},
 {'word': '##ot', 'score': 0.9900722503662109, 'entity': 'I-ORG', 'index': 10},
 {'word': 'C', 'score': 0.9574489593505859, 'entity': 'I-ORG', 'index': 11},
 {'word': '##it', 'score': 0.824583113193512, 'entity': 'I-ORG', 'index': 12},
 {'word': '##ro', 'score': 0.7597800493240356, 'entity': 'I-ORG', 'index': 13},
 {'word': '##A', 'score': 0.953075647354126, 'entity': 'I-ORG', 'index': 14},
 {'word': 'Ā«', 'score': 0.6135829091072083, 'entity': 'I-ORG', 'index': 15}]
1reaction
HHoofscommented, Aug 3, 2020

For sentence 4, this is because the ##pp in ā€œAppointmentsā€, is not being tagged as an entity. This will require a separate PR that assumes that all the word pieces attached to a tagged entity token, should also be tagged with the same entity, whether or not it was tagged.

Although I agree that it could be solved in a next PR, shouldnā€™t this more ā€˜holisticā€™ view be preferable (and be the default). If one token in a word is ā€˜missedā€™ but the other four (e.g. PER-PER-O-PER-PER) are an entity the whole word is an entity (and not two separate entities). We ā€˜knowā€™ what the word-level comprehends the model doesnā€™t

Read more comments on GitHub >

github_iconTop Results From Across the Web

transformers.pipelines.token_classification - Hugging Face
pipeline ` using the following task identifier: :obj:`"ner"` (for predicting the classes of tokens in a sequence: person, organisation, location or miscellaneous)Ā ...
Read more >
What is NER And Why It's Hard to Get Right - Galileo
In this post, we discuss the Named Entity Recognition (NER) task, why it is an important component of various NLP pipelines, and why...
Read more >
Named Entity Recognition System - ScienceDirect.com
NER delves with identification and classification of named entities (NEs) in a given sentence, such as person, organization, place, time, quantities,Ā ...
Read more >
How to do NER predictions with Huggingface BERT transformer
I am doing named entity recognition using tensorflow and Keras. I am using huggingface transformers. I have two datasets. A train dataset and...
Read more >
Named Entity Recognition (NER) with BERT in Spark NLP
NER is a subtask of information extraction that seeks to locate and classify named entities mentioned in unstructured text into pre-definedĀ ...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found