Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

CoNLL2003 ner_tags order mismatch between the dataset from HF and the pretrained model

See original GitHub issue

@patrickvonplaten @dslim23

@dslim23 's pretrained models such as: https://huggingface.co/dslim/bert-base-NER have the following NER tag order baked in: "O", "B-MISC", "I-MISC", "B-PER", "I-PER", "B-ORG", "I-ORG", "B-LOC", "I-LOC"

while the https://huggingface.co/datasets/conll2003 dataset has: O (0), B-PER (1), I-PER (2), B-ORG (3), I-ORG (4) B-LOC (5), I-LOC (6) B-MISC (7), I-MISC (8)

The mismatch leads to defunct accuracy measurements out of the box for the pretrained NER models; try, for instance: python examples/pytorch/token-classification/run_ner.py --model_name_or_path dslim/bert-base-NER --dataset_name conll2003 --output_dir /tmp/test-ner --do_eval

Issue Analytics

State:
Created 2 years ago
Reactions:1
Comments:11 (8 by maintainers)

Top GitHub Comments

1reaction

julien-ccommented, Dec 2, 2021

@patrickvonplaten thanks for the ping – though in the case it’s the script that should be able to remap labels no? The model looks correctly defined with https://huggingface.co/dslim/bert-base-NER/blob/main/config.json

0reactions

github-actions[bot]commented, Jan 21, 2022

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

Top Results From Across the Web

conll2003 · Datasets at Hugging Face

The shared task of CoNLL-2003 concerns language-independent named entity recognition. We will concentrate on four types of named entities: persons, locations, ...

Named-Entity Recognition on HuggingFace – Weights & Biases

Here we will use huggingface transformers based fine-tune pretrained bert based cased model on CoNLL-2003 dataset. CoNLL-2003 dataset consist of word tokens, ...

Argilla - Rubrix

Create a Argilla dataset with unlabelled data and test data . ... transformers: This library provides thousands of pre-trained models for various NLP...

arXiv:2101.08133v2 [cs.CL] 18 Feb 2021

We experiment with two widely-used datasets for evaluation of sequence tagging models and AL query strategies: English CoNLL-2003 (Sang and.

From Preprocessing to Named Entity Recognition, Linking and ...

tently provide one of the top F1-scores on the CoNLL-2003 dataset (Huang et ... the author proposes a Bi-LSTM-CRF model that takes as...