NER label re-alignment always expects B labelled first sub-words
See original GitHub issueEnvironment info
transformers
version: 4.3.1- Platform: Darwin-19.6.0-x86_64-i386-64bit
- Python version: 3.7.7
- PyTorch version (GPU?): 1.7.1 (False)
- Tensorflow version (GPU?): not installed (NA)
- Using GPU in script?: No
- Using distributed or parallel set-up in script?: No
Who can help
- bert, tokenizers, pipelines: @LysandreJik
- trainer, maintained examples: @sgugger
Information
Model I am using (Bert, XLNet …): DistilBERT fine-tuned for conll03
The problem arises when using:
- the official example scripts: (give details below)
- my own modified scripts: (give details below)
The tasks I am working on is:
- an official GLUE/SQUaD task: (give the name)
- my own task or dataset: (give details below)
To reproduce
Steps to reproduce the behavior:
- Fine-tune a BERT model for NER/conll03 using the
run_ner.py
example script, all default values - Correct the label alignments, see config.json
- Infer using entities that have not been seen at training time, and are composed of multiple word-parts as defined by WordPiece (my assumption as to the cause).
- Sub-words are labelled but pipeline re-grouping/label alignment relies on perfect sub-word labelling:
E.g. Accenture → A ##cc ##ent ##ure → B-ORG O O O → A (ORG) E.g. Max Mustermann → Max Must ##erman ##n → B-PER I-PER I-PER O → Max Musterman (PER) E.g. Elasticsearch → El ##astic ##sea #rch → O O I-MISC O → ##sea (MISC)
Expected behavior
I would expect that the realignment takes the label from the first word part or the best scoring sub-word part and propogates that label to the entire word, never returning sub-words. The default in run_ner.py
is to use a padded sub-word label at training as per the BERT paper, but I’ve not tried setting that to False
yet as that’s not the typical/standard practice.
E.g. Accenture → A ##cc ##ent ##ure → B-ORG O O O → Accenture (ORG) E.g. Max Mustermann → Max Must ##erman ##n → B-PER I-PER I-PER O → Max Mustermann (PER) E.g. Elasticsearch → El ##astic ##sea #rch → O O I-MISC O → Elasticsearch (MISC)
I’ll add that it seems odd that this business logic is in the pipeline
. When evaluating on conll03, I assume we are using the sub-words/first word, but this realignment should be considered during evaluation. As-is, I suspect the recall is lower than it should be.
Issue Analytics
- State:
- Created 3 years ago
- Reactions:2
- Comments:9 (9 by maintainers)
Top GitHub Comments
Thanks @elk-cloner for having a look! Happy to contribute by reviewing PRs, etc.
I’ll put this up as a good first issue to see if a member of the community feels like working on it. Thank you for the discussion and for writing all of this up!