Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Pad token for GPT2 and OpenAIGPT models

See original GitHub issue

❓ Questions & Help

I noticed that out of all the models pad_token is not set for only OpenAIGPTModel and GPT-2Model. I get a warning: Using pad_token, but it is not set yet. and pad_token_id is None

Is there any specific reason why is that so?

If not, what is the appropriate padding token to be used for these models?

Thanks

Issue Analytics

State:
Created 4 years ago
Reactions:6
Comments:9 (3 by maintainers)

Top GitHub Comments

8reactions

Choitsuguncommented, Dec 1, 2021

Because GPT2 and GPT are causal LM you don’t need to pad shorter sentences in batches. Why? The pad is to make up the length of the batch. Does this have anything to do with GPT2’s causal model?

6reactions

patrickvonplatencommented, Sep 1, 2020

Because GPT2 and GPT are causal LM you don’t need to pad shorter sentences in batches. It is important though that the loss on these “unnecessary” tokens is not calculated. You should set all lables corresponding to “PADDED” tokens to -100. In the code snippet you can see in the map_to_encoder_decoder_inputs function how the labels are set to -100 for attention_mask = 0: https://huggingface.co/patrickvonplaten/bert2gpt2-cnn_dailymail-fp16#training-script

Top Results From Across the Web

OpenAI GPT2

GPT-2 is a model with absolute position embeddings so it's usually advised to pad the inputs on the right rather than the left....

CLRP📚: GPT2 implementation 👩‍

OpenAI GPT-2 model was proposed in Language Models are Unsupervised ... Define PAD Token = EOS Token = 50256 tokenizer.pad_token = tokenizer.eos_token.

Asking to pad but the tokenizer does not have a padding ...

When tokenizing the sentences of my dataset. The tokenizing code is. SEQ_LEN = MAX_LEN #(50) for pretrained_weights, model_name in MODELS: print ...

Text generation with GPT-2

Tokenizer: they store the vocabulary of each model and include methods to encode and decode strings in a list of token embeddings indexes...

OpenAI GPT2 — adapter-transformers documentation

OpenAI GPT-2 model was proposed in Language Models are Unsupervised ... Since this class does classification on the last token, it requires to...