question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. ItĀ collects links to all the places you might be looking at while hunting down a tough bug.

And, if youā€™re still stuck at the end, weā€™re happy to hop on a call to see how we can help out.

A bug in the padding of input examples in the NER fine-tuning example

See original GitHub issue

šŸ› Bug

Information

Model I am using (Bert, XLNet ā€¦): Roberta

Language I am using the model on (English, Chinese ā€¦): English

The problem arises when using:

  • the official example scripts: (give details below)
  • my own modified scripts: (give details below)

The tasks I am working on is:

  • an official GLUE/SQUaD task: (give the name)
  • my own task or dataset: (give details below)

To reproduce

Steps to reproduce the behavior:

  1. TODO

Expected behavior

https://github.com/huggingface/transformers/blob/c59b1e682d6ebaf7295c63418d4570228904e690/examples/ner/utils_ner.py#L123 This line is supposed to return 3 for Roberta models but itā€™s just returning 2 causing the length of the input_ids to be more than the max_seq_len. This might be the reason for that: https://github.com/huggingface/transformers/blob/master/src/transformers/tokenization_roberta.py#L288 TODO: Share the notebook.

Environment info

  • transformers version: 2.8.0
  • Platform: Linux-4.19.104Ā±x86_64-with-Ubuntu-18.04-bionic
  • Python version: 3.6.9
  • PyTorch version (GPU?): 1.4.0 (True)
  • Tensorflow version (GPU?): 2.2.0-rc2 (True)
  • Using GPU in script?: <fill in>
  • Using distributed or parallel set-up in script?: <fill in>

Issue Analytics

  • State:closed
  • Created 3 years ago
  • Comments:9 (2 by maintainers)

github_iconTop GitHub Comments

1reaction
nbroad1881commented, Apr 29, 2020

@TarasPriadka, @AMR-KELEG

I had a similar issue using preprocess.py on an NER dataset.

Traceback (most recent call last):
  File "preprocess.py", line 12, in <module>
    max_len -= tokenizer.num_special_tokens_to_add()
AttributeError: 'BertTokenizer' object has no attribute 'num_special_tokens_to_add'

I think the PyPi file hasnā€™t been updated, so pip install transformers wonā€™t have the files you need. I built from source and the errors went away. If you try building from source, I think your problem might go away too.

1reaction
TarasPriadkacommented, Apr 17, 2020

I had an issue with the running the NER model. In this commit https://github.com/huggingface/transformers/commit/96ab75b8dd48a9384a74ba4307a4ebfb197343cd num_added_tokens got changed into num_special_tokens_to_add. Just changing the name of the variable in the utils_ner.py fixed the issue for me. However, I had an issue with variable name not being found. Let me know if this fixes you problem.

Read more comments on GitHub >

github_iconTop Results From Across the Web

Fine-tuning with custom datasets - Hugging Face
We'll pass truncation=True and padding=True , which will ensure that all of our sequences are padded to the same length and are truncated...
Read more >
How to Fine-Tune BERT for NER Using HuggingFace
How to Pad the Samples. Another issue is different samples can get tokenized into different lengths, so we need to add pad tokens...
Read more >
BERT Fine-Tuning Tutorial with PyTorch - Chris McCormick
In this tutorial I'll show you how to use BERT with the huggingface ... Side Note: The input format to BERT seems ā€œover-specifiedā€...
Read more >
Make The Most of Your Small NER Data Set by Fine-tuning Bert
We first generate a mask for the padding tokens, then we feed the input to the BERT model. We extract the last hidden...
Read more >
Padding for NLP. Why and what ? | by Caner - Medium
padding =ā€ postā€: add the zeros at the end of the sequence to make the samples in the same size Ā· maxlen=8: this...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found