XLM-RoBERTa NER extraction breaks/splitting the words !
See original GitHub issueI have been using the huggingface xlm-roberta-large-finetuned-conll03-english model NER pipeline for extracting Names, Location and Organization Entities.
But i’m facing an issue now and then with certain entity extraction from short sentences where a word is broken down into sub-word tokens with different entity types. Code used as below
from transformers import pipeline, AutoTokenizer, AutoModelForTokenClassification
tokenizer = AutoTokenizer.from_pretrained("xlm-roberta-large-finetuned-conll03-english")
model = AutoModelForTokenClassification.from_pretrained("xlm-roberta-large-finetuned-conll03-english")
ner_model = pipeline("ner", model = model, tokenizer = tokenizer, grouped_entities = True)
text = "Brennan Nov2018"
ner_model(text)
output:
[ { "entity_group": "PER", "score": 0.6225427985191345, "word": "Brenn", "start": 0, "end": 5 }, { "entity_group": "LOC", "score": 0.759472668170929, "word": "an", "start": 5, "end": 7 } ]
Even though i’m using grouped_entities = True
, i’m still getting some words broken down into 2 different entity groups.
Is there a way to prevent this to happen and to return only complete words as entity ?
- PyTorch Version : 1.7.1
- transformers : 4.6.0
- Python : 3.8.5
Issue Analytics
- State:
- Created 2 years ago
- Reactions:1
- Comments:6 (4 by maintainers)
Top GitHub Comments
@Narsil Could you advise if there is a model on HuggingFace hub that is “word-aware”? I am not sure if I understand it properly, but in my mind, none of the BERT models are actually “word-aware”.
I struggled with this problem earlier last year, and did a lot of search online without a solution. I ended up with an ugly patch downstream to absorb this problem. So thanks for making some improvements to the pipelines.
Hi @ninjalu,
Do you mind explaining a little more what your issue is ? Without context it’s a bit hard to guide you correctly.
Tokenizers “word-aware” are the ones with
continuing_subword_prefix
set (tokenizer.backend_tokenizer.model.continuing_subword_prefix
variable, if it exists). But most likely you shouldn’t choose a tokenizer based purely on this, but probably first on considerations like what data it was trained on and the leveraging you can use in the underlying model (if you’re doing fine-tuning for instance, it’s better to pick a good model for your target data/langage than starting the whole model+tokenizer from scratch)