GPT model `generate()` function not correctly skipping the padding tokens indicated by `attention_mask`
See original GitHub issueAccording to #7552, the padding tokens will be skipped when calculating the postional_id
during generate()
, if the corresponding positions are masked out in attention_mask
. If I understand this correctly, this would mean that the appearance of padding tokens does not matter as long as they are not attended to. However, I found that it is not exactly the case, do I miss something here?
Check the following code for reproduction:
import torch
from transformers import GPTNeoForCausalLM, GPT2Tokenizer
# note that input_str_1 and input_str_2 only differs in number & postion of eos tokens
input_str_1 = "# in a kilometer race , a beats b by 48 meters or 12 seconds . what time does a take to complete the race ? n0 = 48.0 n1 = 12.0\nleg = n0 / n1\n<|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|>"
input_str_2 = "# in a kilometer race , a beats b by 48 meters or 12 seconds . what time does a take to complete the race ? n0 = 48.0 n1 = 12.0\n<|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|>leg = n0 / n1\n"
tokenizer = GPT2Tokenizer.from_pretrained("EleutherAI/gpt-neo-125M")
tokenizer.pad_token = tokenizer.eos_token
gradient_ckpt = True
model = GPTNeoForCausalLM.from_pretrained("EleutherAI/gpt-neo-125M", pad_token_id=tokenizer.eos_token_id, gradient_checkpointing=gradient_ckpt, use_cache=not gradient_ckpt)
def test_generate(input_str: str):
input_ids = tokenizer.encode(input_str, add_special_tokens=False, return_tensors="pt")
attention_mask = torch.where(input_ids == tokenizer.eos_token_id, torch.zeros_like(input_ids), torch.ones_like(input_ids)).to(model.device)
output_ids = model.generate(input_ids, attention_mask=attention_mask, max_new_tokens=30, num_return_sequences=1)
output_str = tokenizer.decode(output_ids[0], skip_special_tokens=False, clean_up_tokenization_spaces=False)
print(f"##################\n{output_str}\n##################")
test_generate(input_str_1)
test_generate(input_str_2)
Issue Analytics
- State:
- Created 2 years ago
- Reactions:1
- Comments:10 (5 by maintainers)
Top Results From Across the Web
Do automatically generated attention masks ignore padding?
Relying on the automatically generated attention mask is not enough because it does not block attention on padding tokens, is that correct?
Read more >Fine-Tuning GPT2 - attention mask and pad token id errors
I am thinking there is a link between this warning message I'm getting and the model not performing well. I got my training,...
Read more >NLG with GPT-2 - Jake Tae
For open-end generation, HuggingFace will set the padding token ID to be equal to the end-of-sentence token ID, so let's configure that manually ......
Read more >The Annotated Transformer - Harvard NLP
Self-attention has been used successfully in a variety of tasks ... no convolution, in order for the model to make use of the...
Read more >Tutorial: Text Classification using GPT2 and Pytorch - YouTube
We've all seen and know how to use Encoder Transformer models like Bert and RoBerta for text classification but did you know you...
Read more >
Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free
Top Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Hey @niansong1996,
I think your understanding is very much correct here. If I understand your example
you are seeing (very) small differences in the output logits that shouldn’t be there. I’m quite sure that this is because masked tokens are not perfectly masked but just increase by a large negative number (-10.000) to not have any issues with float16. Now this is amplified in GPT2 for two reasons:
GPT2 uses a causal mask by default with -10,000 and then in the token is also masked it adds -10,000 again instead of replacing it with just -10,000. E.g. see those lines: https://github.com/huggingface/transformers/blob/39cb6f58e645c90efbcc13593b0d3bf37db2e566/src/transformers/models/gpt2/modeling_gpt2.py#L188
GPT2 has been seen to produce very large logits (e.g.: https://github.com/huggingface/transformers/pull/2303#issuecomment-587375740) which means that small differences in the padding, e.g. using -10,000 and -20,000 instead of -inf before the softmax can actually make a significant difference.
Now taking this into account for your example:
It means the following for
input_str_3
,"of"
attends to"<|endoftext|>"
just with a padding penalty of -10,000 (padding mask) while for"input_str_4"
,"of"
attends to"<|endoftext|>"
just with a padding penalty of -20,000 (padding mask + causal mask). Even though -10,000 and -20,000 both essentially mean the softmax is zero, those differences can up in GPT2 (especially since it tends to have extreme values).I think you’re reasoning is 100% correct and think those small differences on what values are used for padding could be the explanation - you could maybe try to replace all
-10,000
with-torch.inf
to see if the problem persistsI found this issue extremely helpful for my experiment. I was wondering why pretrained decoder-only LM’s are failing to generate anything with
tokenizer.add_special_tokens({‘pad_token’: ‘[PAD]’});model.resize_token_embeddings(len(tokenizer)
. This issue pretty much explains why my implementation failed so badly on generation task. Again, I really appreciate =]