question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

NER label re-alignment always expects B labelled first sub-words

See original GitHub issue

Environment info

  • transformers version: 4.3.1
  • Platform: Darwin-19.6.0-x86_64-i386-64bit
  • Python version: 3.7.7
  • PyTorch version (GPU?): 1.7.1 (False)
  • Tensorflow version (GPU?): not installed (NA)
  • Using GPU in script?: No
  • Using distributed or parallel set-up in script?: No

Who can help

Information

Model I am using (Bert, XLNet …): DistilBERT fine-tuned for conll03

The problem arises when using:

  • the official example scripts: (give details below)
  • my own modified scripts: (give details below)

The tasks I am working on is:

  • an official GLUE/SQUaD task: (give the name)
  • my own task or dataset: (give details below)

To reproduce

Steps to reproduce the behavior:

  1. Fine-tune a BERT model for NER/conll03 using the run_ner.py example script, all default values
  2. Correct the label alignments, see config.json
  3. Infer using entities that have not been seen at training time, and are composed of multiple word-parts as defined by WordPiece (my assumption as to the cause).
  4. Sub-words are labelled but pipeline re-grouping/label alignment relies on perfect sub-word labelling:

E.g. Accenture → A ##cc ##ent ##ure → B-ORG O O O → A (ORG) E.g. Max Mustermann → Max Must ##erman ##n → B-PER I-PER I-PER O → Max Musterman (PER) E.g. Elasticsearch → El ##astic ##sea #rch → O O I-MISC O → ##sea (MISC)

Expected behavior

I would expect that the realignment takes the label from the first word part or the best scoring sub-word part and propogates that label to the entire word, never returning sub-words. The default in run_ner.py is to use a padded sub-word label at training as per the BERT paper, but I’ve not tried setting that to False yet as that’s not the typical/standard practice.

E.g. Accenture → A ##cc ##ent ##ure → B-ORG O O O → Accenture (ORG) E.g. Max Mustermann → Max Must ##erman ##n → B-PER I-PER I-PER O → Max Mustermann (PER) E.g. Elasticsearch → El ##astic ##sea #rch → O O I-MISC O → Elasticsearch (MISC)

I’ll add that it seems odd that this business logic is in the pipeline. When evaluating on conll03, I assume we are using the sub-words/first word, but this realignment should be considered during evaluation. As-is, I suspect the recall is lower than it should be.

Issue Analytics

  • State:closed
  • Created 3 years ago
  • Reactions:2
  • Comments:9 (9 by maintainers)

github_iconTop GitHub Comments

1reaction
joshdevinscommented, Mar 3, 2021

Thanks @elk-cloner for having a look! Happy to contribute by reviewing PRs, etc.

1reaction
LysandreJikcommented, Feb 23, 2021

I’ll put this up as a good first issue to see if a member of the community feels like working on it. Thank you for the discussion and for writing all of this up!

Read more comments on GitHub >

github_iconTop Results From Across the Web

Token classification - Hugging Face
NER attempts to find a label for each entity in a sentence, such as a person, location, or organization. ... Only labeling the...
Read more >
What should be the labels for subword tokens in BERT for ...
Method 2 is the correct one. Leave the actual label of the word only in the first sub-token, and the other sub-tokens will...
Read more >
Lessons Learned from Fine-Tuning BERT for Named Entity ...
First, NER is token-level classification, meaning that the model makes predictions on a word-by-word (or in BERT's case, subword-by-subword) ...
Read more >
Using the GNU Compiler Collection
'throw()', in which case the compiler will always check the return value even ... However it is expected, in the near future, that...
Read more >
The ARIEL-CMU Systems for LoReHLT18 – arXiv Vanity
This would unify the SF, NER, and EDL systems built for speech and text. ... For both IL9 and IL10, no labeled data...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found