Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Issues with the EncoderDecoderModel for sequence to sequence tasks

See original GitHub issue

❓ Questions & Help

I have been attempting with various models to try to build an encoder-decoder, sequence to sequence transformer model. For the most part, I have been using BERT (bert-base-cased), but have encountered issues with various models.

The model is intended for an English to English sequence to sequence problem.

For reference, I had been trying to use the seq2seq example in this pull request as a template :

https://github.com/huggingface/transformers/pull/3402

But have needed to make some modifications to it to account for other recent changes in the EncoderDecoderModel class.

I have a hit a few main issues, three are posted here. I think at least some of them are possibly bugs in the EncoderDecoderModel code.

A recent commit made some major changes to the forward method, and I’ve been hitting issues with the section that defines the decoder_outputs (around line 253 of modeling_encoder_decoder.py.) The example in the pull request I linked does not provide decoder_input_ids when setting up the model, but that is now required by this code in your recent commit. When training, I modified the code to provide decoder_token_ids as the target tokens shifted one to the right with a PAD token in front, as described in various papers. However, I don’t understand why this is required when in eval mode – shouldn’t the model not have any decoder input tokens when in test/eval mode, and only be able to see what the previous tokens it actually output were? I don’t understand what I’m supposed to provide as decoder_input_ids when in evaluation mode, and haven’t been able to find documentation on it.

The code I’m currently using for training looks something like this :

        for step, batch in enumerate(epoch_iterator):


            # Skip past any already trained steps if resuming training
            if steps_trained_in_current_epoch > 0:
                steps_trained_in_current_epoch -= 1
                continue

            model.train()
            batch = tuple(t.to(args.device) for t in batch)
            input_ids, output_ids, input_mask, output_mask, _, decoder_ids = batch

            # add other inputs here, including kwargs
            **inputs = {"input_ids": input_ids, "attention_mask": input_mask, 'decoder_input_ids': decoder_ids}**

            # The output tuple structure depends on the model used and the arguments invoked
            # For BERT-type models, this is
            # decoder_predictions, encoded_embeddings, encoded_attention_mask = model(**inputs)
            # For GPT2-type models, this at least starts with the decoder predictions
            # See the EncoderDecoderModel class for more details
            **output = model(**inputs)**

More context is given in the linked pull request, since again this is being copied from there. The initial pull request does not provide the ‘decoder_input_ids’ parameter, but it seems that is now required. My code is similar in eval mode, but without decoder_input_ids, and this code fails :

**for batch in tqdm(eval_dataloader, desc="Evaluating"):
        batch = tuple(t.to(args.device) for t in batch)
        input_ids, output_ids, input_mask, output_mask, _, decoder_ids = batch
        with torch.no_grad():
            inputs = {"input_ids": input_ids, "attention_mask": input_mask}

            # The output tuple structure depends on the model used and the arguments invoked
            # For BERT-type models, this is
            # decoder_predictions, encoded_embeddings, encoded_attention_mask = model(**inputs)
            # For GPT2-type models, this at least starts with the decoder predictions
            # See the EncoderDecoderModel class for more details
            output = model(**inputs)**

This code fails in modeling_encoder_decoder, line 283 with

ValueError: You have to specify either input_ids or inputs_embeds

The pull request uses a GPT2 model as an example, but that no longer works because the code mentioned from #1 requires some parameters like encoder_hidden_states that GPT2 does not take at initialization. When I try to create a GPT2 model I get exceptions regarding this extra parameter. In other words, when I switch from a bert-bert model to a gpt2-gpt2 model, the code posted above fails in the “forward” method of the EncoderDecoderModel (line 283 of modeling_encoder_decoder) because “encoder_hidden_states” is an unexpected param for GPT2. Is this intended / is GPT2 no longer supported for an encoder decoder architecture using this code?
This one is just more of a general question… but since I’m posting the above 2 as issues anyways, I figured I’d add it here in case anybody can clarify and save a separate issue being created…

I believe I’m doing this part correctly, but it was not handled in the example code so want to verify if possible… For the attention mask for the decoder, during training all non-PAD tokens are expected to be unmasked, and during evaluation no mask should be provided and a default causal mask will be used, right?

@patrickvonplaten , tagging you in this issue as requested.

Thank you for your time!! Let me know if you need more code, again my code is 95% or so identical to the run_seq2seq.py example in the linked PR, just with some changes to account for recent modifications in modeling_encoder_decoder.py

Issue Analytics

State:
Created 3 years ago
Comments:21 (8 by maintainers)

Top GitHub Comments

1reaction

patrickvonplatencommented, Jul 10, 2020

Hey, as usual I’m very late on my own time timelines, but I started working on a Bert2Bert tutorial for summarization yesterday 😃. It’s still work in progress, but it will be ready by next week.

The code should work as it is - I have to fine-tune the hyper parameters and want to add some nicer metrics to measure the performance during training.

If you want to follow the work live 😄 here is the google colab I’m working on at the moment: https://colab.research.google.com/drive/13RXRepDN7bOJgdxPuIziwbcN8mpJLyNo?usp=sharing

@iliemihai, one thing I can directly see from your notebook is that I think you are not masking the loss of padded tokens so that the loss of all pad token id is back propagated through the network.

Since your decoder_input_ids are in PyTorch I think you can do the following for your labels:

labels = decoder_input_ids.clone()
# mask loss for padding
labels[labels == tokenizer.pad_token_id] = -100

1reaction

patrickvonplatencommented, May 29, 2020

Hi @dbaxter240, Multiple bugs were fixed in #4680. Can you please take a look whether this error persists?

I think ideally you should not copy paste old encoder decoder code into another repo since the code quickly becomes outdated and is hard to debug for us. The EncoderDecoderModel is still a very premature feature of this library and prone to change quickly. It would be great if you could try to use as much up-to-date code of this library as possible.

I’m very sorry, for only finding this big bug now! It seems like you have invested quite a lot of energy into your code. I will soon (~2 weeks) open-source a notebook giving a nice example of how the EncoderDecoderModel can be leverage to fine-tune a Bert2Bert model.

Also note that this PR #3402 is rather outdated and since we don’t provide EncoderDecoderModel support for GPT2 at the moment still not possible.

Top Results From Across the Web

Understanding Encoder-Decoder Sequence to Sequence Model

This model can be used as a solution to any sequence-based problem, especially ones where the inputs and outputs have different sizes and ......

Encoder-Decoder Seq2Seq Models, Clearly Explained!!

This was one of the first papers to introduce the Encoder-Decoder model for machine translation and more generally sequence-to-sequence models.

Leveraging Pre-trained Language Model Checkpoints for ...

Having warm-stared the encoder-decoder model, the weights are then fine-tuned on a sequence-to-sequence downstream task, such as summarization.

How to Develop an Encoder-Decoder Model for Sequence-to ...

The encoder-decoder model provides a pattern for using recurrent neural networks to address challenging sequence-to-sequence prediction problems ...

Sequence-to-sequence Models - Stanford NLP Group

tasks other than language modeling, because modeling sequential information is useful in language, apparently. Only use neural nets. Here's our RNN encoder,.