Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

[DOC] Fine-Tuning NER Custom Dataset Clarification

See original GitHub issue

I’m following this guide for fine-tuning for NER with a custom dataset. I struggled with the example code for def encode_tags() until I realized, that the tokens per sample are limited to 512 and my dataset exceeded this in some instances. This resulted in errors like this: ValueError: NumPy boolean array indexing assignment cannot assign 544 input values to the 464 output values where the mask is true.

I currently assume, the limit is due to the specific Tokenizer. I’m using tokenizer = DistilBertTokenizerFast.from_pretrained('distilbert-base-cased') as in the example.

I’m proposing to add a clarification about the token limit per sample assumption like this:

https://github.com/huggingface/transformers/edit/master/docs/source/custom_datasets.rst Line 365 and following:

Let’s write a function to do this. This is where we will use the offset_mapping from the tokenizer as mentioned above. For each sub-token returned by the tokenizer, the offset mapping gives us a tuple indicating the sub-token’s start position and end position relative to the original token it was split from. That means that if the first position in the tuple is anything other than 0, we will set its corresponding label to -100. While we’re at it, we can also set labels to -100 if the second position of the offset mapping is 0, since this means it must be a special token like [PAD] or [CLS]. And append: Be aware that this example has an upper limit of 512 tokens per sample.

Let me know your thoughts and I’ll open a PR, if you find this useful.

Issue Analytics

State:
Created 2 years ago
Comments:9 (1 by maintainers)

Top GitHub Comments

1reaction

jorahncommented, May 11, 2021

Update: Even when I ensure the number of tokens per sample is <= 512, I get ValueErrors from calling encode_tags on some samples. I’ll try to understand this better or provide a demo.

0reactions

Akshay0799commented, Aug 20, 2022

@jorahn Thanks for letting us know about the tokenize_and_align_labels() function. But when I follow the method mentioned in the notebook I’m getting an error with data collator saying: AttributeError: ‘tokenizers.Encoding’ object has no attribute ‘keys’

Top Results From Across the Web

Tutorial: Fine-tuning with custom datasets – sentiment, NER ...

This tutorial demonstrates one workflow for working with custom datasets, but there are many valid ways to accomplish the same thing. The ...

Fine-Tuning Hugging Face Model with Custom Dataset

End-to-end example to explain how to fine-tune the Hugging Face model with a custom dataset using TensorFlow and Keras.

Tutorial: How to Fine-tune BERT for NER - Skim AI

Introduction. This article is on how to fine-tune BERT for Named Entity Recognition (NER). Specifically, how to train a BERT variation, SpanBERTa, for...

️ Label your data to fine-tune a classifier with Hugging Face

This tutorial will show you how to fine-tune a sentiment classifier for your own domain, starting with no labeled data. Most online tutorials...

How to Fine-Tune BERT for NER Using HuggingFace

How to Preprocess the Dataset ; import AutoTokenizer tokenizer ·. ; #Get the values for input_ids, token_type_ids, attention_mask ...