question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

NER pipeline: Inconsistent entity grouping

See original GitHub issue

🐛 Bug

Information

“mrm8488/bert-spanish-cased-finetuned-ner”

Language I am using the model on (English, Chinese …): Spanish

The problem arises when using:

  • the official example scripts: (give details below)
  • my own modified scripts: (give details below)

The tasks I am working on is:

  • an official GLUE/SQUaD task: (give the name)
  • my own task or dataset: (give details below)

To reproduce

Steps to reproduce the behavior:

  1. create a ner pipeline
  2. pass flag grouped_entities
  3. entities are not grouped as expected see sample below
NER_MODEL = "mrm8488/bert-spanish-cased-finetuned-ner"
nlp_ner = pipeline("ner", model=NER_MODEL,
                   grouped_entities=True,
                   tokenizer=(NER_MODEL, {"use_fast": False}))

t = """Consuelo Araújo Noguera, ministra de cultura del presidente Andrés Pastrana (1998.2002) fue asesinada por las Farc luego de haber permanecido secuestrada por algunos meses."""
ner(t)
>>> 
[ {'entity_group': 'B-PER', 'score': 0.901019960641861, 'word': 'Consuelo'}, 
 {'entity_group': 'I-PER', 'score': 0.9990904808044434, 'word': 'Araújo Noguera'}, 
 {'entity_group': 'B-PER', 'score': 0.9998136162757874, 'word': 'Andrés'}, 
 {'entity_group': 'I-PER', 'score': 0.9996985991795858, 'word': 'Pastrana'}, 
 {'entity_group': 'B-ORG', 'score': 0.9989739060401917, 'word': 'Far'}]

Expected behavior

Inconsistent grouping

I expect the first two items of the given sample( B-PER, and I-PER) to be grouped. As they are contiguous tokens and correspond to a single entity spot. It seems the current code does not take into account B and I tokens.

expected output:

 {'entity_group': 'I-PER', 'score': 0.9990904808044434, 'word': ' Consuelo Araújo Noguera'}, 
 {'entity_group': 'I-PER', 'score': 0.9998136162757874, 'word': 'Andrés Pastrana'}, 
 {'entity_group': 'B-ORG', 'score': 0.9989739060401917, 'word': 'Farc'}]

Lost tokens?

for the same input, passing grouped_entities=False generates the following output:

[
{'word': 'Cons', 'score': 0.9994944930076599, 'entity': 'B-PER', 'index': 1},
{'word': '##uelo', 'score': 0.802545428276062, 'entity': 'B-PER', 'index': 2}, 
{'word': 'Ara', 'score': 0.9993102550506592, 'entity': 'I-PER', 'index': 3}, 
{'word': '##új', 'score': 0.9993743896484375, 'entity': 'I-PER', 'index': 4}, 
{'word': '##o', 'score': 0.9992871880531311, 'entity': 'I-PER', 'index': 5}, 
{'word': 'No', 'score': 0.9993029236793518, 'entity': 'I-PER', 'index': 6}, 
{'word': '##guera', 'score': 0.9981776475906372, 'entity': 'I-PER', 'index': 7}, 
{'word': 'Andrés', 'score': 0.9998136162757874, 'entity': 'B-PER', 'index': 15}, 
{'word': 'Pas', 'score': 0.999740719795227, 'entity': 'I-PER', 'index': 16}, 
{'word': '##tran', 'score': 0.9997414350509644, 'entity': 'I-PER', 'index': 17}, 
{'word': '##a', 'score': 0.9996136426925659, 'entity': 'I-PER', 'index': 18}, 
{'word': 'Far', 'score': 0.9989739060401917, 'entity': 'B-ORG', 'index': 28}, 
{'word': '##c', 'score': 0.7188423275947571, 'entity': 'I-ORG', 'index': 29}]

when using grouped_entities the last entity word (##c) got lost, it is not even considered as a different group

{'entity_group': 'B-ORG', 'score': 0.9989739060401917, 'word': 'Far'}]

Environment info

  • transformers version: 2.11.0
  • Platform: OSX
  • Python version: 3.7
  • PyTorch version (GPU?): 1.5.0
  • Tensorflow version (GPU?):
  • Using GPU in script?: no
  • Using distributed or parallel set-up in script?: no

Issue Analytics

  • State:closed
  • Created 3 years ago
  • Comments:19 (9 by maintainers)

github_iconTop GitHub Comments

4reactions
enzoampilcommented, Jun 6, 2020

@dav009 Thanks for posting this issue!

  1. Inconsistent grouping - correct that B and I tokens are not yet considered. Will have to include this in a new PR.
  2. Lost tokens - the skipped tokens are those with an entity type found in the ignore_labels argument for TokenClassificationPipeline, which is set as ["O"] by default. If you don’t want to skip any token, you can just set ignore_labels=[].

I’m happy to work on 1 within the next week or so since I’ve already been planning to apply this fix.

2reactions
enzoampilcommented, Jun 16, 2020

@Nighthyst I see, you’re bringing up a different issue now. This is the case where the entity type of a word’s word piece, is different from other word pieces.

A fix I can apply here is to automatically group word pieces together regardless of entity type. I can apply this to a new PR after merging this existing one.

Read more comments on GitHub >

github_iconTop Results From Across the Web

Joint Parsing and Named Entity Recognition
We first present the joint, discriminative model that we use, which is a feature-based CRF-CFG parser operating over tree structures augmented with. NER...
Read more >
transformers.pipelines.token_classification - Hugging Face
grouped_entities (:obj:`bool`, `optional`, defaults to :obj:`False`): Whether or not to group the tokens corresponding to the same entity together in the ...
Read more >
Named Entity Recognition System - ScienceDirect.com
A typical NER system pipeline includes preprocessing the data such as tokenization, sentence splitting, feature extraction, applying ML models on the data ...
Read more >
Joint Learning for Biomedical NER and Entity Normalization
Named entity recognition (NER) and normalization (EN) form an ... Pipelines inherently involve separate models for each constituent task and ...
Read more >
Training Pipelines & Models · spaCy Usage Documentation
ner ] defines the settings for the pipeline's named entity recognizer. The config can be loaded as a Python dict. References to registered...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found