Fine-tuning GPT: problems with padding
See original GitHub issueEnvironment info
transformers
version: 3.4.0- Platform: Linux-3.10.0-957.el7.x86_64-x86_64-with-glibc2.10
- Python version: 3.8.3
- PyTorch version (GPU?): 1.5.1+cu101 (True)
- Tensorflow version (GPU?): not installed (NA)
- Using GPU in script?: yes
- Using distributed or parallel set-up in script?: no
Who can help
@LysandreJik tokenizers: @mfuntowicz
Information
Model I am using openai-gpt:
The problem arises when using:
- the official example scripts: (give details below)
- my own modified scripts: (give details below)
The scripts are my own scripts inspired by the glue examples.
The tasks I am working on is:
- an official GLUE/SQUaD task: (give the name)
- my own task or dataset: (give details below)
Simple binary text classification, nothing fancy, inspired by the glue example files.
To reproduce
Steps to reproduce the behavior:
As reported in other issues, padding is not done for GPT* models. One workaround for this issue is to set the padding token to the eos token. This seems to work fine for the GPT2 models (I tried GPT2 and DistilGPT2), but creates some issues for the GPT model. Comparing the outputs of the two models, it looks like the config file for the GPT2 models contains ids for bos and eos tokens, while these are missing from the GPT config file (not sure this is the real problem). Some other interesting bits from the outputs:
ftfy or spacy is not installed using BERT BasicTokenizer instead of SpaCy & ftfy.
Using eos_token, but it is not set yet.
Bottom line, it crashes with ValueError: Asking to pad but the tokenizer does not have a padding token. Please select a token to use as pad_token
(tokenizer.pad_token = tokenizer.eos_token e.g.)
or add a new pad token via tokenizer.add_special_tokens({'pad_token': '[PAD]'})
- despite the fact that I have tokenizer.pad_token = tokenizer.eos_token
in the code.
I’m expecting some issue with the tokenizer/missing ids for the special tokens. Wondering if there is something missing in the config file for the model.
Expected behavior
No error? 😃 I don’t see any of these issues after setting the padding token to the eos token for the GPT2 model. As I briefly mentioned above, the only difference that I see in the config file is the ids for the eos/bos tokens, which seem to be missing from the GPT model config.
Thanks for your help!
Issue Analytics
- State:
- Created 3 years ago
- Reactions:1
- Comments:11 (4 by maintainers)
To get GPT2 to work, you’ll also need to update the config’s pad token to be the eos token:
config.pad_token_id = config.eos_token_id
For example, in
examples/lightning_base.py
, I’ve added the below lines right after loading the tokenizer in BaseTransformer().__init__():Indeed, the root of the issue seems to be that you’re asking your tokenizer to pad the sequences, but it does not have a padding token, and therefore cannot do so.
If setting the tokenizer’s pad token to the eos token doesn’t work, you can try adding a new token to the tokenizer with the
add_special_tokens()
method, and then resize the model embedding layer.Seeing as you should use the attention mask when padding, these tokens should have close to zero influence on your training.
See the docs about the aforementioned methods here