Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Fine-tuning GPT: problems with padding

See original GitHub issue

Environment info

transformers version: 3.4.0
Platform: Linux-3.10.0-957.el7.x86_64-x86_64-with-glibc2.10
Python version: 3.8.3
PyTorch version (GPU?): 1.5.1+cu101 (True)
Tensorflow version (GPU?): not installed (NA)
Using GPU in script?: yes
Using distributed or parallel set-up in script?: no

Who can help

@LysandreJik tokenizers: @mfuntowicz

Information

Model I am using openai-gpt:

The problem arises when using:

the official example scripts: (give details below)
my own modified scripts: (give details below)

The scripts are my own scripts inspired by the glue examples.

The tasks I am working on is:

an official GLUE/SQUaD task: (give the name)
my own task or dataset: (give details below)

Simple binary text classification, nothing fancy, inspired by the glue example files.

To reproduce

Steps to reproduce the behavior:

As reported in other issues, padding is not done for GPT* models. One workaround for this issue is to set the padding token to the eos token. This seems to work fine for the GPT2 models (I tried GPT2 and DistilGPT2), but creates some issues for the GPT model. Comparing the outputs of the two models, it looks like the config file for the GPT2 models contains ids for bos and eos tokens, while these are missing from the GPT config file (not sure this is the real problem). Some other interesting bits from the outputs:

ftfy or spacy is not installed using BERT BasicTokenizer instead of SpaCy & ftfy.
Using eos_token, but it is not set yet.

Bottom line, it crashes with ValueError: Asking to pad but the tokenizer does not have a padding token. Please select a token to use as pad_token (tokenizer.pad_token = tokenizer.eos_token e.g.) or add a new pad token via tokenizer.add_special_tokens({'pad_token': '[PAD]'}) - despite the fact that I have tokenizer.pad_token = tokenizer.eos_token in the code.

I’m expecting some issue with the tokenizer/missing ids for the special tokens. Wondering if there is something missing in the config file for the model.

Expected behavior

No error? 😃 I don’t see any of these issues after setting the padding token to the eos token for the GPT2 model. As I briefly mentioned above, the only difference that I see in the config file is the ids for the eos/bos tokens, which seem to be missing from the GPT model config.

Thanks for your help!

Issue Analytics

State:
Created 3 years ago
Reactions:1
Comments:11 (4 by maintainers)

Top GitHub Comments

12reactions

ethanjperezcommented, Dec 4, 2020

To get GPT2 to work, you’ll also need to update the config’s pad token to be the eos token: config.pad_token_id = config.eos_token_id

For example, in examples/lightning_base.py, I’ve added the below lines right after loading the tokenizer in BaseTransformer().__init__():

        if self.tokenizer.pad_token is None:
            self.tokenizer.pad_token = self.tokenizer.eos_token
            self.config.pad_token_id = self.config.eos_token_id

3reactions

LysandreJikcommented, Nov 11, 2020

Indeed, the root of the issue seems to be that you’re asking your tokenizer to pad the sequences, but it does not have a padding token, and therefore cannot do so.

If setting the tokenizer’s pad token to the eos token doesn’t work, you can try adding a new token to the tokenizer with the add_special_tokens() method, and then resize the model embedding layer.

Seeing as you should use the attention mask when padding, these tokens should have close to zero influence on your training.

See the docs about the aforementioned methods here

Top Results From Across the Web

Guide to fine-tuning Text Generation models: GPT-2, GPT-Neo ...

Guide to fine-tuning Text Generation models: GPT-2, GPT-Neo and T5 ... variety of NLP based tasks into the text-in text-out type of problem....

Fine-tuning GPT2 for movie script generation (in PyTorch)

I want to fine tune GPT-2 on movie scripts in PyTorch. My goal is to supply a movie genre to GPT-2 and have...

GPT2 For Text Classification using Hugging Face Transformers

This notebook is used to fine-tune GPT2 model for text classification using Huggingface transformers library on a custom dataset. Hugging Face is very...

How to fine-tune gpt-j using Huggingface Trainer

The data_collator parameter seems to take care of the exact issue that I was having. data_collator = DataCollatorForLanguageModeling(tokenizer, ...

How to use GPT-3, GPT-J and GPT-NeoX, with few-shot learning

Few-shot learning is like training/fine-tuning an AI model, ... you can also fine-tune GPT-3 on OpenAI's website and GPT-J on NLP Cloud so ......