A bug in the padding of input examples in the NER fine-tuning example
See original GitHub issueš Bug
Information
Model I am using (Bert, XLNet ā¦): Roberta
Language I am using the model on (English, Chinese ā¦): English
The problem arises when using:
- the official example scripts: (give details below)
- my own modified scripts: (give details below)
The tasks I am working on is:
- an official GLUE/SQUaD task: (give the name)
- my own task or dataset: (give details below)
To reproduce
Steps to reproduce the behavior:
- TODO
Expected behavior
https://github.com/huggingface/transformers/blob/c59b1e682d6ebaf7295c63418d4570228904e690/examples/ner/utils_ner.py#L123 This line is supposed to return 3 for Roberta models but itās just returning 2 causing the length of the input_ids to be more than the max_seq_len. This might be the reason for that: https://github.com/huggingface/transformers/blob/master/src/transformers/tokenization_roberta.py#L288 TODO: Share the notebook.
Environment info
transformers
version: 2.8.0- Platform: Linux-4.19.104Ā±x86_64-with-Ubuntu-18.04-bionic
- Python version: 3.6.9
- PyTorch version (GPU?): 1.4.0 (True)
- Tensorflow version (GPU?): 2.2.0-rc2 (True)
- Using GPU in script?: <fill in>
- Using distributed or parallel set-up in script?: <fill in>
Issue Analytics
- State:
- Created 3 years ago
- Comments:9 (2 by maintainers)
Top Results From Across the Web
Fine-tuning with custom datasets - Hugging Face
We'll pass truncation=True and padding=True , which will ensure that all of our sequences are padded to the same length and are truncated...
Read more >How to Fine-Tune BERT for NER Using HuggingFace
How to Pad the Samples. Another issue is different samples can get tokenized into different lengths, so we need to add pad tokens...
Read more >BERT Fine-Tuning Tutorial with PyTorch - Chris McCormick
In this tutorial I'll show you how to use BERT with the huggingface ... Side Note: The input format to BERT seems āover-specifiedā...
Read more >Make The Most of Your Small NER Data Set by Fine-tuning Bert
We first generate a mask for the padding tokens, then we feed the input to the BERT model. We extract the last hidden...
Read more >Padding for NLP. Why and what ? | by Caner - Medium
padding =ā postā: add the zeros at the end of the sequence to make the samples in the same size Ā· maxlen=8: this...
Read more >Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start FreeTop Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Top GitHub Comments
@TarasPriadka, @AMR-KELEG
I had a similar issue using
preprocess.py
on an NER dataset.I think the PyPi file hasnāt been updated, so
pip install transformers
wonāt have the files you need. I built from source and the errors went away. If you try building from source, I think your problem might go away too.I had an issue with the running the NER model. In this commit https://github.com/huggingface/transformers/commit/96ab75b8dd48a9384a74ba4307a4ebfb197343cd
num_added_tokens
got changed intonum_special_tokens_to_add
. Just changing the name of the variable in theutils_ner.py
fixed the issue for me. However, I had an issue with variable name not being found. Let me know if this fixes you problem.