Inconsistent padding behavior for decoder_input_ids for Seq2Seq models
See original GitHub issueSystem Info
transformers : 4.18.0 torch: 1.12.0 Python 3.7.13
Who can help?
@patrickvonplaten @patil-suraj
Information
- The official example scripts
- My own modified scripts
Tasks
- An officially supported task in the
examples
folder (such as GLUE/SQuAD, …) - My own task or dataset (give details below)
Reproduction
from transformers import AutoModelForSeq2SeqLM, AutoTokenizer
import torch
models = [
"t5-small",
"google/mt5-small",
"facebook/m2m100_418M",
"facebook/wmt19-ru-en",
"facebook/bart-base",
"facebook/blenderbot-400M-distill",
"google/bigbird-pegasus-large-arxiv",
"allenai/led-base-16384",
"microsoft/prophetnet-large-uncased"
]
for model_name in models:
# load the seq2seq model
model = AutoModelForSeq2SeqLM.from_pretrained(model_name)
# tokenizer
tokenizer = AutoTokenizer.from_pretrained(model_name)
tokenizer.padding_side = "left"
# sample sentence
sample_sentence = "generate some numbers"
encodings = tokenizer(sample_sentence,
padding="max_length",
max_length=5,
return_tensors="pt",
return_attention_mask=True,
truncation=True)
# decoder input ids (with a default start token for the model)
decoder_input_ids = torch.ones(1,1, dtype=torch.int32) * model.config.decoder_start_token_id
# model's forward without any padding for decoder_input_ids (hence without decoder_attn mask)
outputs = model.forward(input_ids=encodings.input_ids,
attention_mask=encodings.attention_mask,
decoder_input_ids=decoder_input_ids,
return_dict=True)
next_token_logits = outputs["logits"][:,-1, :]
# same decoder input ids but padded + decoder attention mask
decoder_input_ids_with_padding = torch.ones(1,3, dtype=torch.int32) * tokenizer.pad_token_id
decoder_input_ids_with_padding[:,-1] = model.config.decoder_start_token_id
decoder_attn_mask = torch.zeros(1,3)
decoder_attn_mask[:,-1] = 1
# model's forward with padding for decoder_input_ids (hence with decoder_attn mask)
outputs_with_padding = model.forward(input_ids=encodings.input_ids,
attention_mask=encodings.attention_mask,
decoder_input_ids=decoder_input_ids_with_padding,
decoder_attention_mask=decoder_attn_mask,
return_dict=True)
next_token_logits_with_padding = outputs_with_padding["logits"][:,-1,:]
# check if padding affects the logits
if torch.allclose(next_token_logits, next_token_logits_with_padding, atol=1e-3):
print(f"No issues with model: {model_name}")
else:
print(f"Issues with model: {model_name}")
Expected behavior
This issue is regarding seq2seq models for conditional text generation.
There are differences in the output logits when padding is used for decoder_input_ids (by passing also decoder_attention_mask). This issue exists only for a few models (eg: BART, BlendorBot, Pegasus etc) and for other models there are no output differences (eg: T5, MT5 etc). Hence there is no consistency in the output across diff seq2seq models.
To reproduce these differences, run the provided script which does the following:
- Do one forward pass for a sample prompt (input_ids, attention_mask), additionally passing the default start token for the decoder.
- Do another forward pass for the prompt (same input_ids and attention_mask). But this time, decoder_input_ids is left padded to a seq length of 3 with the same default start token as the last token. Additionally, decoder_attention_mask is passed to avoid attending to padded tokens.
- Last token logits from these two forward passes are compared for equivalence (with a tolerance of 1e-3)
And this is done for several seq2seq models to see which models have these differences.
Ideally, we would expect padding not to cause any such differences.
Issue Analytics
- State:
- Created a year ago
- Comments:12 (10 by maintainers)
Top Results From Across the Web
MarianMT - Hugging Face
The language codes used to name models are inconsistent. Two digit codes can usually be found here, three digit codes require googling “language...
Read more >Pytorch inconsistent size with pad_packed_sequence, seq2seq
I'm having some inconsistencies with the output of a encoder I got from this github . The encoder looks as follows: class Encoder(nn....
Read more >Deep Reinforcement Learning for Sequence-to ... - PubMed
However, such seq2seq models suffer from two common problems: 1) exposure bias and 2) inconsistency between train/test measurement. Recently, a completely novel ...
Read more >10.7. Encoder-Decoder Seq2Seq for Machine Translation
10.7.1 illustrates how to use two RNNs for sequence to sequence learning in ... We will train this model for machine translation on...
Read more >How to Develop a Seq2Seq Model for Neural Machine ...
Machine Translation Data · Input Sequences: Padded to a maximum length of 16 characters with a vocabulary of 71 different characters (10000, 16, ......
Read more >Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start FreeTop Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Top GitHub Comments
Hey! 🙌 it’s on my to do list, but can’t look at it right now so feel free to do so 😀🤗
cc @ArthurZucker