Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

XLM-RoBERTa NER extraction breaks/splitting the words !

See original GitHub issue

I have been using the huggingface xlm-roberta-large-finetuned-conll03-english model NER pipeline for extracting Names, Location and Organization Entities.

But i’m facing an issue now and then with certain entity extraction from short sentences where a word is broken down into sub-word tokens with different entity types. Code used as below

from transformers import pipeline, AutoTokenizer, AutoModelForTokenClassification
  
tokenizer = AutoTokenizer.from_pretrained("xlm-roberta-large-finetuned-conll03-english")
model = AutoModelForTokenClassification.from_pretrained("xlm-roberta-large-finetuned-conll03-english")

ner_model = pipeline("ner", model = model, tokenizer = tokenizer, grouped_entities = True)

text = "Brennan Nov2018"
ner_model(text)

output: [ { "entity_group": "PER", "score": 0.6225427985191345, "word": "Brenn", "start": 0, "end": 5 }, { "entity_group": "LOC", "score": 0.759472668170929, "word": "an", "start": 5, "end": 7 } ]

Even though i’m using grouped_entities = True , i’m still getting some words broken down into 2 different entity groups.

Is there a way to prevent this to happen and to return only complete words as entity ?

PyTorch Version : 1.7.1
transformers : 4.6.0
Python : 3.8.5

Issue Analytics

State:
Created 2 years ago
Reactions:1
Comments:6 (4 by maintainers)

Top GitHub Comments

1reaction

ninjalucommented, Mar 18, 2022

@Narsil Could you advise if there is a model on HuggingFace hub that is “word-aware”? I am not sure if I understand it properly, but in my mind, none of the BERT models are actually “word-aware”.

I struggled with this problem earlier last year, and did a lot of search online without a solution. I ended up with an ugly patch downstream to absorb this problem. So thanks for making some improvements to the pipelines.

0reactions

Narsilcommented, Mar 21, 2022

Hi @ninjalu,

Do you mind explaining a little more what your issue is ? Without context it’s a bit hard to guide you correctly.

Tokenizers “word-aware” are the ones with continuing_subword_prefix set (tokenizer.backend_tokenizer.model.continuing_subword_prefix variable, if it exists). But most likely you shouldn’t choose a tokenizer based purely on this, but probably first on considerations like what data it was trained on and the leveraging you can use in the underlying model (if you’re doing fine-tuning for instance, it’s better to pick a good model for your target data/langage than starting the whole model+tokenizer from scratch)

Top Results From Across the Web

XLM-RoBERTa - Hugging Face

This paper shows that pretraining multilingual language models at scale leads to significant performance gains for a wide range of cross-lingual transfer tasks....

Chapter 4. Multilingual Named Entity Recognition - O'Reilly

The role of the model is to split the words into subwords to reduce the size of the vocabulary and try to reduce...

Named Entity Recognition with Transformers - Chris Tran

The script below will split sentences longer than MAX_LENGTH (in terms of tokens) into small ones. Otherwise, long sentences will be truncated ...

Named Entity Extraction Workflow with | ArcGIS API for Python

Some models like BERT , RoBERTa , XLNET , XLM-RoBERTa are highly accurate but at ... This is achieved by extracting word features...

Low-Resource Named Entity Recognition via the Pre-Training ...

Named entity recognition (NER) is an important task in the processing of ... words in low-resource languages from being split at the character...