question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Pad token for GPT2 and OpenAIGPT models

See original GitHub issue

❓ Questions & Help

I noticed that out of all the models pad_token is not set for only OpenAIGPTModel and GPT-2Model. I get a warning: Using pad_token, but it is not set yet. and pad_token_id is None

Is there any specific reason why is that so?

If not, what is the appropriate padding token to be used for these models?

Thanks

Issue Analytics

  • State:closed
  • Created 4 years ago
  • Reactions:6
  • Comments:9 (3 by maintainers)

github_iconTop GitHub Comments

8reactions
Choitsuguncommented, Dec 1, 2021

Because GPT2 and GPT are causal LM you don’t need to pad shorter sentences in batches. Why? The pad is to make up the length of the batch. Does this have anything to do with GPT2’s causal model?

6reactions
patrickvonplatencommented, Sep 1, 2020

Because GPT2 and GPT are causal LM you don’t need to pad shorter sentences in batches. It is important though that the loss on these “unnecessary” tokens is not calculated. You should set all lables corresponding to “PADDED” tokens to -100. In the code snippet you can see in the map_to_encoder_decoder_inputs function how the labels are set to -100 for attention_mask = 0: https://huggingface.co/patrickvonplaten/bert2gpt2-cnn_dailymail-fp16#training-script

Read more comments on GitHub >

github_iconTop Results From Across the Web

OpenAI GPT2
GPT-2 is a model with absolute position embeddings so it's usually advised to pad the inputs on the right rather than the left....
Read more >
CLRP📚: GPT2 implementation 👩‍
OpenAI GPT-2 model was proposed in Language Models are Unsupervised ... Define PAD Token = EOS Token = 50256 tokenizer.pad_token = tokenizer.eos_token.
Read more >
Asking to pad but the tokenizer does not have a padding ...
When tokenizing the sentences of my dataset. The tokenizing code is. SEQ_LEN = MAX_LEN #(50) for pretrained_weights, model_name in MODELS: print ...
Read more >
Text generation with GPT-2
Tokenizer: they store the vocabulary of each model and include methods to encode and decode strings in a list of token embeddings indexes...
Read more >
OpenAI GPT2 — adapter-transformers documentation
OpenAI GPT-2 model was proposed in Language Models are Unsupervised ... Since this class does classification on the last token, it requires to...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found