NER pipeline: Inconsistent entity grouping
See original GitHub issue🐛 Bug
Information
“mrm8488/bert-spanish-cased-finetuned-ner”
Language I am using the model on (English, Chinese …): Spanish
The problem arises when using:
- the official example scripts: (give details below)
- my own modified scripts: (give details below)
The tasks I am working on is:
- an official GLUE/SQUaD task: (give the name)
- my own task or dataset: (give details below)
To reproduce
Steps to reproduce the behavior:
- create a
ner
pipeline - pass flag
grouped_entities
- entities are not grouped as expected see sample below
NER_MODEL = "mrm8488/bert-spanish-cased-finetuned-ner"
nlp_ner = pipeline("ner", model=NER_MODEL,
grouped_entities=True,
tokenizer=(NER_MODEL, {"use_fast": False}))
t = """Consuelo Araújo Noguera, ministra de cultura del presidente Andrés Pastrana (1998.2002) fue asesinada por las Farc luego de haber permanecido secuestrada por algunos meses."""
ner(t)
>>>
[ {'entity_group': 'B-PER', 'score': 0.901019960641861, 'word': 'Consuelo'},
{'entity_group': 'I-PER', 'score': 0.9990904808044434, 'word': 'Araújo Noguera'},
{'entity_group': 'B-PER', 'score': 0.9998136162757874, 'word': 'Andrés'},
{'entity_group': 'I-PER', 'score': 0.9996985991795858, 'word': 'Pastrana'},
{'entity_group': 'B-ORG', 'score': 0.9989739060401917, 'word': 'Far'}]
Expected behavior
Inconsistent grouping
I expect the first two items of the given sample( B-PER
, and I-PER
) to be grouped. As they are contiguous tokens and correspond to a single entity spot. It seems the current code does not take into account B
and I
tokens.
expected output:
{'entity_group': 'I-PER', 'score': 0.9990904808044434, 'word': ' Consuelo Araújo Noguera'},
{'entity_group': 'I-PER', 'score': 0.9998136162757874, 'word': 'Andrés Pastrana'},
{'entity_group': 'B-ORG', 'score': 0.9989739060401917, 'word': 'Farc'}]
Lost tokens?
for the same input, passing grouped_entities=False
generates the following output:
[
{'word': 'Cons', 'score': 0.9994944930076599, 'entity': 'B-PER', 'index': 1},
{'word': '##uelo', 'score': 0.802545428276062, 'entity': 'B-PER', 'index': 2},
{'word': 'Ara', 'score': 0.9993102550506592, 'entity': 'I-PER', 'index': 3},
{'word': '##új', 'score': 0.9993743896484375, 'entity': 'I-PER', 'index': 4},
{'word': '##o', 'score': 0.9992871880531311, 'entity': 'I-PER', 'index': 5},
{'word': 'No', 'score': 0.9993029236793518, 'entity': 'I-PER', 'index': 6},
{'word': '##guera', 'score': 0.9981776475906372, 'entity': 'I-PER', 'index': 7},
{'word': 'Andrés', 'score': 0.9998136162757874, 'entity': 'B-PER', 'index': 15},
{'word': 'Pas', 'score': 0.999740719795227, 'entity': 'I-PER', 'index': 16},
{'word': '##tran', 'score': 0.9997414350509644, 'entity': 'I-PER', 'index': 17},
{'word': '##a', 'score': 0.9996136426925659, 'entity': 'I-PER', 'index': 18},
{'word': 'Far', 'score': 0.9989739060401917, 'entity': 'B-ORG', 'index': 28},
{'word': '##c', 'score': 0.7188423275947571, 'entity': 'I-ORG', 'index': 29}]
when using grouped_entities
the last entity word
(##c
) got lost, it is not even considered as a different group
{'entity_group': 'B-ORG', 'score': 0.9989739060401917, 'word': 'Far'}]
Environment info
transformers
version: 2.11.0- Platform: OSX
- Python version: 3.7
- PyTorch version (GPU?): 1.5.0
- Tensorflow version (GPU?):
- Using GPU in script?: no
- Using distributed or parallel set-up in script?: no
Issue Analytics
- State:
- Created 3 years ago
- Comments:19 (9 by maintainers)
Top Results From Across the Web
Joint Parsing and Named Entity Recognition
We first present the joint, discriminative model that we use, which is a feature-based CRF-CFG parser operating over tree structures augmented with. NER...
Read more >transformers.pipelines.token_classification - Hugging Face
grouped_entities (:obj:`bool`, `optional`, defaults to :obj:`False`): Whether or not to group the tokens corresponding to the same entity together in the ...
Read more >Named Entity Recognition System - ScienceDirect.com
A typical NER system pipeline includes preprocessing the data such as tokenization, sentence splitting, feature extraction, applying ML models on the data ...
Read more >Joint Learning for Biomedical NER and Entity Normalization
Named entity recognition (NER) and normalization (EN) form an ... Pipelines inherently involve separate models for each constituent task and ...
Read more >Training Pipelines & Models · spaCy Usage Documentation
ner ] defines the settings for the pipeline's named entity recognizer. The config can be loaded as a Python dict. References to registered...
Read more >Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start FreeTop Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Top GitHub Comments
@dav009 Thanks for posting this issue!
B
andI
tokens are not yet considered. Will have to include this in a new PR.ignore_labels
argument forTokenClassificationPipeline
, which is set as["O"]
by default. If you don’t want to skip any token, you can just setignore_labels=[]
.I’m happy to work on
1
within the next week or so since I’ve already been planning to apply this fix.@Nighthyst I see, you’re bringing up a different issue now. This is the case where the entity type of a word’s word piece, is different from other word pieces.
A fix I can apply here is to automatically group word pieces together regardless of entity type. I can apply this to a new PR after merging this existing one.