Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Inconsistent padding behavior for decoder_input_ids for Seq2Seq models

See original GitHub issue

System Info

transformers : 4.18.0 torch: 1.12.0 Python 3.7.13

Who can help?

@patrickvonplaten @patil-suraj

Information

The official example scripts
My own modified scripts

Tasks

An officially supported task in the examples folder (such as GLUE/SQuAD, …)
My own task or dataset (give details below)

Reproduction

from transformers import AutoModelForSeq2SeqLM, AutoTokenizer
import torch

models = [
    "t5-small",
    "google/mt5-small",
    "facebook/m2m100_418M",
    "facebook/wmt19-ru-en",
    "facebook/bart-base",
    "facebook/blenderbot-400M-distill",
    "google/bigbird-pegasus-large-arxiv",
    "allenai/led-base-16384",
    "microsoft/prophetnet-large-uncased"
]

for model_name in models: 

    # load the seq2seq model
    model = AutoModelForSeq2SeqLM.from_pretrained(model_name)

    # tokenizer
    tokenizer = AutoTokenizer.from_pretrained(model_name)
    tokenizer.padding_side = "left"

    # sample sentence
    sample_sentence = "generate some numbers"
    encodings = tokenizer(sample_sentence, 
                        padding="max_length",
                        max_length=5,
                        return_tensors="pt",
                        return_attention_mask=True,
                        truncation=True)

    # decoder input ids (with a default start token for the model)
    decoder_input_ids = torch.ones(1,1, dtype=torch.int32) * model.config.decoder_start_token_id

    # model's forward without any padding for decoder_input_ids (hence without decoder_attn mask)
    outputs = model.forward(input_ids=encodings.input_ids,
                            attention_mask=encodings.attention_mask,
                            decoder_input_ids=decoder_input_ids,
                            return_dict=True)
    next_token_logits = outputs["logits"][:,-1, :]


    # same decoder input ids but padded  + decoder attention mask
    decoder_input_ids_with_padding = torch.ones(1,3, dtype=torch.int32) * tokenizer.pad_token_id
    decoder_input_ids_with_padding[:,-1] = model.config.decoder_start_token_id
    decoder_attn_mask = torch.zeros(1,3)
    decoder_attn_mask[:,-1] = 1

    # model's forward with padding for decoder_input_ids (hence with decoder_attn mask)
    outputs_with_padding = model.forward(input_ids=encodings.input_ids,
                                        attention_mask=encodings.attention_mask,
                                        decoder_input_ids=decoder_input_ids_with_padding,
                                        decoder_attention_mask=decoder_attn_mask,
                                        return_dict=True)
    next_token_logits_with_padding = outputs_with_padding["logits"][:,-1,:]
    

    # check if padding affects the logits
    if torch.allclose(next_token_logits, next_token_logits_with_padding, atol=1e-3):
        print(f"No issues with model: {model_name}")
    else:
        print(f"Issues with model: {model_name}")

Expected behavior

This issue is regarding seq2seq models for conditional text generation.

There are differences in the output logits when padding is used for decoder_input_ids (by passing also decoder_attention_mask). This issue exists only for a few models (eg: BART, BlendorBot, Pegasus etc) and for other models there are no output differences (eg: T5, MT5 etc). Hence there is no consistency in the output across diff seq2seq models.

To reproduce these differences, run the provided script which does the following:

Do one forward pass for a sample prompt (input_ids, attention_mask), additionally passing the default start token for the decoder.
Do another forward pass for the prompt (same input_ids and attention_mask). But this time, decoder_input_ids is left padded to a seq length of 3 with the same default start token as the last token. Additionally, decoder_attention_mask is passed to avoid attending to padded tokens.
Last token logits from these two forward passes are compared for equivalence (with a tolerance of 1e-3)

And this is done for several seq2seq models to see which models have these differences.

Ideally, we would expect padding not to cause any such differences.

Issue Analytics

State:
Created a year ago
Comments:12 (10 by maintainers)

Top GitHub Comments

1reaction

ArthurZuckercommented, Oct 17, 2022

Hey! 🙌 it’s on my to do list, but can’t look at it right now so feel free to do so 😀🤗

1reaction

sguggercommented, Oct 14, 2022

cc @ArthurZucker

Top Results From Across the Web

MarianMT - Hugging Face

The language codes used to name models are inconsistent. Two digit codes can usually be found here, three digit codes require googling “language...

Pytorch inconsistent size with pad_packed_sequence, seq2seq

I'm having some inconsistencies with the output of a encoder I got from this github . The encoder looks as follows: class Encoder(nn....

Deep Reinforcement Learning for Sequence-to ... - PubMed

However, such seq2seq models suffer from two common problems: 1) exposure bias and 2) inconsistency between train/test measurement. Recently, a completely novel ...

10.7. Encoder-Decoder Seq2Seq for Machine Translation

10.7.1 illustrates how to use two RNNs for sequence to sequence learning in ... We will train this model for machine translation on...

How to Develop a Seq2Seq Model for Neural Machine ...

Machine Translation Data · Input Sequences: Padded to a maximum length of 16 characters with a vocabulary of 71 different characters (10000, 16, ......