Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

NER pipeline: Inconsistent entity grouping

See original GitHub issue

🐛 Bug

Information

“mrm8488/bert-spanish-cased-finetuned-ner”

Language I am using the model on (English, Chinese …): Spanish

The problem arises when using:

the official example scripts: (give details below)
my own modified scripts: (give details below)

The tasks I am working on is:

an official GLUE/SQUaD task: (give the name)
my own task or dataset: (give details below)

To reproduce

Steps to reproduce the behavior:

create a ner pipeline
pass flag grouped_entities
entities are not grouped as expected see sample below

NER_MODEL = "mrm8488/bert-spanish-cased-finetuned-ner"
nlp_ner = pipeline("ner", model=NER_MODEL,
                   grouped_entities=True,
                   tokenizer=(NER_MODEL, {"use_fast": False}))

t = """Consuelo Araújo Noguera, ministra de cultura del presidente Andrés Pastrana (1998.2002) fue asesinada por las Farc luego de haber permanecido secuestrada por algunos meses."""
ner(t)
>>> 
[ {'entity_group': 'B-PER', 'score': 0.901019960641861, 'word': 'Consuelo'}, 
 {'entity_group': 'I-PER', 'score': 0.9990904808044434, 'word': 'Araújo Noguera'}, 
 {'entity_group': 'B-PER', 'score': 0.9998136162757874, 'word': 'Andrés'}, 
 {'entity_group': 'I-PER', 'score': 0.9996985991795858, 'word': 'Pastrana'}, 
 {'entity_group': 'B-ORG', 'score': 0.9989739060401917, 'word': 'Far'}]

Expected behavior

Inconsistent grouping

I expect the first two items of the given sample( B-PER, and I-PER) to be grouped. As they are contiguous tokens and correspond to a single entity spot. It seems the current code does not take into account B and I tokens.

expected output:

 {'entity_group': 'I-PER', 'score': 0.9990904808044434, 'word': ' Consuelo Araújo Noguera'}, 
 {'entity_group': 'I-PER', 'score': 0.9998136162757874, 'word': 'Andrés Pastrana'}, 
 {'entity_group': 'B-ORG', 'score': 0.9989739060401917, 'word': 'Farc'}]

Lost tokens?

for the same input, passing grouped_entities=False generates the following output:

[
{'word': 'Cons', 'score': 0.9994944930076599, 'entity': 'B-PER', 'index': 1},
{'word': '##uelo', 'score': 0.802545428276062, 'entity': 'B-PER', 'index': 2}, 
{'word': 'Ara', 'score': 0.9993102550506592, 'entity': 'I-PER', 'index': 3}, 
{'word': '##új', 'score': 0.9993743896484375, 'entity': 'I-PER', 'index': 4}, 
{'word': '##o', 'score': 0.9992871880531311, 'entity': 'I-PER', 'index': 5}, 
{'word': 'No', 'score': 0.9993029236793518, 'entity': 'I-PER', 'index': 6}, 
{'word': '##guera', 'score': 0.9981776475906372, 'entity': 'I-PER', 'index': 7}, 
{'word': 'Andrés', 'score': 0.9998136162757874, 'entity': 'B-PER', 'index': 15}, 
{'word': 'Pas', 'score': 0.999740719795227, 'entity': 'I-PER', 'index': 16}, 
{'word': '##tran', 'score': 0.9997414350509644, 'entity': 'I-PER', 'index': 17}, 
{'word': '##a', 'score': 0.9996136426925659, 'entity': 'I-PER', 'index': 18}, 
{'word': 'Far', 'score': 0.9989739060401917, 'entity': 'B-ORG', 'index': 28}, 
{'word': '##c', 'score': 0.7188423275947571, 'entity': 'I-ORG', 'index': 29}]

when using grouped_entities the last entity word (##c) got lost, it is not even considered as a different group

{'entity_group': 'B-ORG', 'score': 0.9989739060401917, 'word': 'Far'}]

Environment info

transformers version: 2.11.0
Platform: OSX
Python version: 3.7
PyTorch version (GPU?): 1.5.0
Tensorflow version (GPU?):
Using GPU in script?: no
Using distributed or parallel set-up in script?: no

Issue Analytics

State:
Created 3 years ago
Comments:19 (9 by maintainers)

Top GitHub Comments

4reactions

enzoampilcommented, Jun 6, 2020

@dav009 Thanks for posting this issue!

Inconsistent grouping - correct that B and I tokens are not yet considered. Will have to include this in a new PR.
Lost tokens - the skipped tokens are those with an entity type found in the ignore_labels argument for TokenClassificationPipeline, which is set as ["O"] by default. If you don’t want to skip any token, you can just set ignore_labels=[].

I’m happy to work on 1 within the next week or so since I’ve already been planning to apply this fix.

2reactions

enzoampilcommented, Jun 16, 2020

@Nighthyst I see, you’re bringing up a different issue now. This is the case where the entity type of a word’s word piece, is different from other word pieces.

A fix I can apply here is to automatically group word pieces together regardless of entity type. I can apply this to a new PR after merging this existing one.

Top Results From Across the Web

Joint Parsing and Named Entity Recognition

We first present the joint, discriminative model that we use, which is a feature-based CRF-CFG parser operating over tree structures augmented with. NER...

transformers.pipelines.token_classification - Hugging Face

grouped_entities (:obj:`bool`, `optional`, defaults to :obj:`False`): Whether or not to group the tokens corresponding to the same entity together in the ...

Named Entity Recognition System - ScienceDirect.com

A typical NER system pipeline includes preprocessing the data such as tokenization, sentence splitting, feature extraction, applying ML models on the data ...

Joint Learning for Biomedical NER and Entity Normalization

Named entity recognition (NER) and normalization (EN) form an ... Pipelines inherently involve separate models for each constituent task and ...

Training Pipelines & Models · spaCy Usage Documentation

ner ] defines the settings for the pipeline's named entity recognizer. The config can be loaded as a Python dict. References to registered...