Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

GPT model `generate()` function not correctly skipping the padding tokens indicated by `attention_mask`

See original GitHub issue

According to #7552, the padding tokens will be skipped when calculating the postional_id during generate(), if the corresponding positions are masked out in attention_mask. If I understand this correctly, this would mean that the appearance of padding tokens does not matter as long as they are not attended to. However, I found that it is not exactly the case, do I miss something here?

Check the following code for reproduction:

import torch
from transformers import GPTNeoForCausalLM, GPT2Tokenizer

# note that input_str_1 and input_str_2 only differs in number & postion of eos tokens
input_str_1 = "# in a kilometer race , a beats b by 48 meters or 12 seconds . what time does a take to complete the race ? n0 = 48.0 n1 = 12.0\nleg = n0 / n1\n<|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|>"
input_str_2 = "# in a kilometer race , a beats b by 48 meters or 12 seconds . what time does a take to complete the race ? n0 = 48.0 n1 = 12.0\n<|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|>leg = n0 / n1\n"

tokenizer = GPT2Tokenizer.from_pretrained("EleutherAI/gpt-neo-125M")
tokenizer.pad_token = tokenizer.eos_token
gradient_ckpt = True
model = GPTNeoForCausalLM.from_pretrained("EleutherAI/gpt-neo-125M", pad_token_id=tokenizer.eos_token_id, gradient_checkpointing=gradient_ckpt, use_cache=not gradient_ckpt)

def test_generate(input_str: str):
    input_ids = tokenizer.encode(input_str, add_special_tokens=False, return_tensors="pt")
    attention_mask = torch.where(input_ids == tokenizer.eos_token_id, torch.zeros_like(input_ids), torch.ones_like(input_ids)).to(model.device)
    output_ids = model.generate(input_ids, attention_mask=attention_mask, max_new_tokens=30, num_return_sequences=1)
    output_str = tokenizer.decode(output_ids[0], skip_special_tokens=False, clean_up_tokenization_spaces=False)

    print(f"##################\n{output_str}\n##################")

test_generate(input_str_1)
test_generate(input_str_2)

Issue Analytics

State:
Created 2 years ago
Reactions:1
Comments:10 (5 by maintainers)

Top GitHub Comments

4reactions

patrickvonplatencommented, Dec 13, 2021

Hey @niansong1996,

I think your understanding is very much correct here. If I understand your example

##################
tensor([-15.8802, -16.3779, -15.6428,  ..., -21.8622, -17.9515, -14.6956])
##################
##################
tensor([-15.8802, -16.3779, -15.6428,  ..., -21.8622, -17.9514, -14.6956])
##################
tensor(0.6359)

you are seeing (very) small differences in the output logits that shouldn’t be there. I’m quite sure that this is because masked tokens are not perfectly masked but just increase by a large negative number (-10.000) to not have any issues with float16. Now this is amplified in GPT2 for two reasons:

GPT2 uses a causal mask by default with -10,000 and then in the token is also masked it adds -10,000 again instead of replacing it with just -10,000. E.g. see those lines: https://github.com/huggingface/transformers/blob/39cb6f58e645c90efbcc13593b0d3bf37db2e566/src/transformers/models/gpt2/modeling_gpt2.py#L188
GPT2 has been seen to produce very large logits (e.g.: https://github.com/huggingface/transformers/pull/2303#issuecomment-587375740) which means that small differences in the padding, e.g. using -10,000 and -20,000 instead of -inf before the softmax can actually make a significant difference.

Now taking this into account for your example:

input_str_3 = "This is a test of<|endoftext|> some"
input_str_4 = "This is a test<|endoftext|> of some"

It means the following for input_str_3, "of" attends to "<|endoftext|>" just with a padding penalty of -10,000 (padding mask) while for "input_str_4", "of" attends to "<|endoftext|>" just with a padding penalty of -20,000 (padding mask + causal mask). Even though -10,000 and -20,000 both essentially mean the softmax is zero, those differences can up in GPT2 (especially since it tends to have extreme values).

I think you’re reasoning is 100% correct and think those small differences on what values are used for padding could be the explanation - you could maybe try to replace all -10,000 with -torch.inf to see if the problem persists

3reactions

sonsuscommented, Feb 3, 2022

I found this issue extremely helpful for my experiment. I was wondering why pretrained decoder-only LM’s are failing to generate anything with tokenizer.add_special_tokens({‘pad_token’: ‘[PAD]’});model.resize_token_embeddings(len(tokenizer). This issue pretty much explains why my implementation failed so badly on generation task. Again, I really appreciate =]

Top Results From Across the Web

Do automatically generated attention masks ignore padding?

Relying on the automatically generated attention mask is not enough because it does not block attention on padding tokens, is that correct?

Fine-Tuning GPT2 - attention mask and pad token id errors

I am thinking there is a link between this warning message I'm getting and the model not performing well. I got my training,...

NLG with GPT-2 - Jake Tae

For open-end generation, HuggingFace will set the padding token ID to be equal to the end-of-sentence token ID, so let's configure that manually ......

The Annotated Transformer - Harvard NLP

Self-attention has been used successfully in a variety of tasks ... no convolution, in order for the model to make use of the...

Tutorial: Text Classification using GPT2 and Pytorch - YouTube

We've all seen and know how to use Encoder Transformer models like Bert and RoBerta for text classification but did you know you...