question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

CoNLL2003 ner_tags order mismatch between the dataset from HF and the pretrained model

See original GitHub issue

@patrickvonplaten @dslim23

@dslim23 's pretrained models such as: https://huggingface.co/dslim/bert-base-NER have the following NER tag order baked in: "O", "B-MISC", "I-MISC", "B-PER", "I-PER", "B-ORG", "I-ORG", "B-LOC", "I-LOC"

while the https://huggingface.co/datasets/conll2003 dataset has: O (0), B-PER (1), I-PER (2), B-ORG (3), I-ORG (4) B-LOC (5), I-LOC (6) B-MISC (7), I-MISC (8)

The mismatch leads to defunct accuracy measurements out of the box for the pretrained NER models; try, for instance: python examples/pytorch/token-classification/run_ner.py --model_name_or_path dslim/bert-base-NER --dataset_name conll2003 --output_dir /tmp/test-ner --do_eval

Issue Analytics

  • State:closed
  • Created 2 years ago
  • Reactions:1
  • Comments:11 (8 by maintainers)

github_iconTop GitHub Comments

1reaction
julien-ccommented, Dec 2, 2021

@patrickvonplaten thanks for the ping – though in the case it’s the script that should be able to remap labels no? The model looks correctly defined with https://huggingface.co/dslim/bert-base-NER/blob/main/config.json

0reactions
github-actions[bot]commented, Jan 21, 2022

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

Read more comments on GitHub >

github_iconTop Results From Across the Web

conll2003 · Datasets at Hugging Face
The shared task of CoNLL-2003 concerns language-independent named entity recognition. We will concentrate on four types of named entities: persons, locations, ...
Read more >
Named-Entity Recognition on HuggingFace – Weights & Biases
Here we will use huggingface transformers based fine-tune pretrained bert based cased model on CoNLL-2003 dataset. CoNLL-2003 dataset consist of word tokens, ...
Read more >
Argilla - Rubrix
Create a Argilla dataset with unlabelled data and test data . ... transformers: This library provides thousands of pre-trained models for various NLP...
Read more >
arXiv:2101.08133v2 [cs.CL] 18 Feb 2021
We experiment with two widely-used datasets for evaluation of sequence tagging models and AL query strategies: English CoNLL-2003 (Sang and.
Read more >
From Preprocessing to Named Entity Recognition, Linking and ...
tently provide one of the top F1-scores on the CoNLL-2003 dataset (Huang et ... the author proposes a Bi-LSTM-CRF model that takes as...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found