Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

NER label re-alignment always expects B labelled first sub-words

See original GitHub issue

Environment info

transformers version: 4.3.1
Platform: Darwin-19.6.0-x86_64-i386-64bit
Python version: 3.7.7
PyTorch version (GPU?): 1.7.1 (False)
Tensorflow version (GPU?): not installed (NA)
Using GPU in script?: No
Using distributed or parallel set-up in script?: No

Who can help

bert, tokenizers, pipelines: @LysandreJik
trainer, maintained examples: @sgugger

Information

Model I am using (Bert, XLNet …): DistilBERT fine-tuned for conll03

The problem arises when using:

the official example scripts: (give details below)
my own modified scripts: (give details below)

The tasks I am working on is:

an official GLUE/SQUaD task: (give the name)
my own task or dataset: (give details below)

To reproduce

Steps to reproduce the behavior:

Fine-tune a BERT model for NER/conll03 using the run_ner.py example script, all default values
Correct the label alignments, see config.json
Infer using entities that have not been seen at training time, and are composed of multiple word-parts as defined by WordPiece (my assumption as to the cause).
Sub-words are labelled but pipeline re-grouping/label alignment relies on perfect sub-word labelling:

E.g. Accenture → A ##cc ##ent ##ure → B-ORG O O O → A (ORG) E.g. Max Mustermann → Max Must ##erman ##n → B-PER I-PER I-PER O → Max Musterman (PER) E.g. Elasticsearch → El ##astic ##sea #rch → O O I-MISC O → ##sea (MISC)

Expected behavior

I would expect that the realignment takes the label from the first word part or the best scoring sub-word part and propogates that label to the entire word, never returning sub-words. The default in run_ner.py is to use a padded sub-word label at training as per the BERT paper, but I’ve not tried setting that to False yet as that’s not the typical/standard practice.

E.g. Accenture → A ##cc ##ent ##ure → B-ORG O O O → Accenture (ORG) E.g. Max Mustermann → Max Must ##erman ##n → B-PER I-PER I-PER O → Max Mustermann (PER) E.g. Elasticsearch → El ##astic ##sea #rch → O O I-MISC O → Elasticsearch (MISC)

I’ll add that it seems odd that this business logic is in the pipeline. When evaluating on conll03, I assume we are using the sub-words/first word, but this realignment should be considered during evaluation. As-is, I suspect the recall is lower than it should be.

Issue Analytics

State:
Created 3 years ago
Reactions:2
Comments:9 (9 by maintainers)

Top GitHub Comments

1reaction

joshdevinscommented, Mar 3, 2021

Thanks @elk-cloner for having a look! Happy to contribute by reviewing PRs, etc.

1reaction

LysandreJikcommented, Feb 23, 2021

I’ll put this up as a good first issue to see if a member of the community feels like working on it. Thank you for the discussion and for writing all of this up!