question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

[DOC] Fine-Tuning NER Custom Dataset Clarification

See original GitHub issue

I’m following this guide for fine-tuning for NER with a custom dataset. I struggled with the example code for def encode_tags() until I realized, that the tokens per sample are limited to 512 and my dataset exceeded this in some instances. This resulted in errors like this: ValueError: NumPy boolean array indexing assignment cannot assign 544 input values to the 464 output values where the mask is true.

I currently assume, the limit is due to the specific Tokenizer. I’m using tokenizer = DistilBertTokenizerFast.from_pretrained('distilbert-base-cased') as in the example.

I’m proposing to add a clarification about the token limit per sample assumption like this:

https://github.com/huggingface/transformers/edit/master/docs/source/custom_datasets.rst Line 365 and following:

Let’s write a function to do this. This is where we will use the offset_mapping from the tokenizer as mentioned above. For each sub-token returned by the tokenizer, the offset mapping gives us a tuple indicating the sub-token’s start position and end position relative to the original token it was split from. That means that if the first position in the tuple is anything other than 0, we will set its corresponding label to -100. While we’re at it, we can also set labels to -100 if the second position of the offset mapping is 0, since this means it must be a special token like [PAD] or [CLS]. And append: Be aware that this example has an upper limit of 512 tokens per sample.

Let me know your thoughts and I’ll open a PR, if you find this useful.

Issue Analytics

  • State:closed
  • Created 2 years ago
  • Comments:9 (1 by maintainers)

github_iconTop GitHub Comments

1reaction
jorahncommented, May 11, 2021

Update: Even when I ensure the number of tokens per sample is <= 512, I get ValueErrors from calling encode_tags on some samples. I’ll try to understand this better or provide a demo.

0reactions
Akshay0799commented, Aug 20, 2022

@jorahn Thanks for letting us know about the tokenize_and_align_labels() function. But when I follow the method mentioned in the notebook I’m getting an error with data collator saying: AttributeError: ‘tokenizers.Encoding’ object has no attribute ‘keys’

Read more comments on GitHub >

github_iconTop Results From Across the Web

Tutorial: Fine-tuning with custom datasets – sentiment, NER ...
This tutorial demonstrates one workflow for working with custom datasets, but there are many valid ways to accomplish the same thing. The ...
Read more >
Fine-Tuning Hugging Face Model with Custom Dataset
End-to-end example to explain how to fine-tune the Hugging Face model with a custom dataset using TensorFlow and Keras.
Read more >
Tutorial: How to Fine-tune BERT for NER - Skim AI
Introduction. This article is on how to fine-tune BERT for Named Entity Recognition (NER). Specifically, how to train a BERT variation, SpanBERTa, for...
Read more >
️ Label your data to fine-tune a classifier with Hugging Face
This tutorial will show you how to fine-tune a sentiment classifier for your own domain, starting with no labeled data. Most online tutorials...
Read more >
How to Fine-Tune BERT for NER Using HuggingFace
How to Preprocess the Dataset ; import AutoTokenizer tokenizer ·. ; #Get the values for input_ids, token_type_ids, attention_mask ...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found