question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

XLM-RoBERTa NER extraction breaks/splitting the words !

See original GitHub issue

I have been using the huggingface xlm-roberta-large-finetuned-conll03-english model NER pipeline for extracting Names, Location and Organization Entities.

But i’m facing an issue now and then with certain entity extraction from short sentences where a word is broken down into sub-word tokens with different entity types. Code used as below

from transformers import pipeline, AutoTokenizer, AutoModelForTokenClassification
  
tokenizer = AutoTokenizer.from_pretrained("xlm-roberta-large-finetuned-conll03-english")
model = AutoModelForTokenClassification.from_pretrained("xlm-roberta-large-finetuned-conll03-english")

ner_model = pipeline("ner", model = model, tokenizer = tokenizer, grouped_entities = True)

text = "Brennan Nov2018"
ner_model(text)

output: [ { "entity_group": "PER", "score": 0.6225427985191345, "word": "Brenn", "start": 0, "end": 5 }, { "entity_group": "LOC", "score": 0.759472668170929, "word": "an", "start": 5, "end": 7 } ]

image

Even though i’m using grouped_entities = True , i’m still getting some words broken down into 2 different entity groups.

Is there a way to prevent this to happen and to return only complete words as entity ?

  • PyTorch Version : 1.7.1
  • transformers : 4.6.0
  • Python : 3.8.5

Issue Analytics

  • State:closed
  • Created 2 years ago
  • Reactions:1
  • Comments:6 (4 by maintainers)

github_iconTop GitHub Comments

1reaction
ninjalucommented, Mar 18, 2022

@Narsil Could you advise if there is a model on HuggingFace hub that is “word-aware”? I am not sure if I understand it properly, but in my mind, none of the BERT models are actually “word-aware”.

I struggled with this problem earlier last year, and did a lot of search online without a solution. I ended up with an ugly patch downstream to absorb this problem. So thanks for making some improvements to the pipelines.

0reactions
Narsilcommented, Mar 21, 2022

Hi @ninjalu,

Do you mind explaining a little more what your issue is ? Without context it’s a bit hard to guide you correctly.

Tokenizers “word-aware” are the ones with continuing_subword_prefix set (tokenizer.backend_tokenizer.model.continuing_subword_prefix variable, if it exists). But most likely you shouldn’t choose a tokenizer based purely on this, but probably first on considerations like what data it was trained on and the leveraging you can use in the underlying model (if you’re doing fine-tuning for instance, it’s better to pick a good model for your target data/langage than starting the whole model+tokenizer from scratch)

Read more comments on GitHub >

github_iconTop Results From Across the Web

XLM-RoBERTa - Hugging Face
This paper shows that pretraining multilingual language models at scale leads to significant performance gains for a wide range of cross-lingual transfer tasks....
Read more >
Chapter 4. Multilingual Named Entity Recognition - O'Reilly
The role of the model is to split the words into subwords to reduce the size of the vocabulary and try to reduce...
Read more >
Named Entity Recognition with Transformers - Chris Tran
The script below will split sentences longer than MAX_LENGTH (in terms of tokens) into small ones. Otherwise, long sentences will be truncated ...
Read more >
Named Entity Extraction Workflow with | ArcGIS API for Python
Some models like BERT , RoBERTa , XLNET , XLM-RoBERTa are highly accurate but at ... This is achieved by extracting word features...
Read more >
Low-Resource Named Entity Recognition via the Pre-Training ...
Named entity recognition (NER) is an important task in the processing of ... words in low-resource languages from being split at the character...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found