[DOC] Fine-Tuning NER Custom Dataset Clarification
See original GitHub issueI’m following this guide for fine-tuning for NER with a custom dataset. I struggled with the example code for def encode_tags()
until I realized, that the tokens per sample are limited to 512 and my dataset exceeded this in some instances. This resulted in errors like this:
ValueError: NumPy boolean array indexing assignment cannot assign 544 input values to the 464 output values where the mask is true
.
I currently assume, the limit is due to the specific Tokenizer. I’m using tokenizer = DistilBertTokenizerFast.from_pretrained('distilbert-base-cased')
as in the example.
I’m proposing to add a clarification about the token limit per sample assumption like this:
https://github.com/huggingface/transformers/edit/master/docs/source/custom_datasets.rst Line 365 and following:
Let’s write a function to do this. This is where we will use the
offset_mapping
from the tokenizer as mentioned above. For each sub-token returned by the tokenizer, the offset mapping gives us a tuple indicating the sub-token’s start position and end position relative to the original token it was split from. That means that if the first position in the tuple is anything other than0
, we will set its corresponding label to-100
. While we’re at it, we can also set labels to-100
if the second position of the offset mapping is0
, since this means it must be a special token like[PAD]
or[CLS]
. And append:Be aware that this example has an upper limit of 512 tokens per sample.
Let me know your thoughts and I’ll open a PR, if you find this useful.
Issue Analytics
- State:
- Created 2 years ago
- Comments:9 (1 by maintainers)
Update: Even when I ensure the number of tokens per sample is <= 512, I get ValueErrors from calling
encode_tags
on some samples. I’ll try to understand this better or provide a demo.@jorahn Thanks for letting us know about the tokenize_and_align_labels() function. But when I follow the method mentioned in the notebook I’m getting an error with data collator saying: AttributeError: ‘tokenizers.Encoding’ object has no attribute ‘keys’