Issues with the EncoderDecoderModel for sequence to sequence tasks
See original GitHub issue❓ Questions & Help
I have been attempting with various models to try to build an encoder-decoder, sequence to sequence transformer model. For the most part, I have been using BERT (bert-base-cased), but have encountered issues with various models.
The model is intended for an English to English sequence to sequence problem.
For reference, I had been trying to use the seq2seq example in this pull request as a template :
https://github.com/huggingface/transformers/pull/3402
But have needed to make some modifications to it to account for other recent changes in the EncoderDecoderModel class.
I have a hit a few main issues, three are posted here. I think at least some of them are possibly bugs in the EncoderDecoderModel code.
- A recent commit made some major changes to the forward method, and I’ve been hitting issues with the section that defines the decoder_outputs (around line 253 of modeling_encoder_decoder.py.) The example in the pull request I linked does not provide decoder_input_ids when setting up the model, but that is now required by this code in your recent commit. When training, I modified the code to provide decoder_token_ids as the target tokens shifted one to the right with a PAD token in front, as described in various papers. However, I don’t understand why this is required when in eval mode – shouldn’t the model not have any decoder input tokens when in test/eval mode, and only be able to see what the previous tokens it actually output were? I don’t understand what I’m supposed to provide as decoder_input_ids when in evaluation mode, and haven’t been able to find documentation on it.
The code I’m currently using for training looks something like this :
for step, batch in enumerate(epoch_iterator):
# Skip past any already trained steps if resuming training
if steps_trained_in_current_epoch > 0:
steps_trained_in_current_epoch -= 1
continue
model.train()
batch = tuple(t.to(args.device) for t in batch)
input_ids, output_ids, input_mask, output_mask, _, decoder_ids = batch
# add other inputs here, including kwargs
**inputs = {"input_ids": input_ids, "attention_mask": input_mask, 'decoder_input_ids': decoder_ids}**
# The output tuple structure depends on the model used and the arguments invoked
# For BERT-type models, this is
# decoder_predictions, encoded_embeddings, encoded_attention_mask = model(**inputs)
# For GPT2-type models, this at least starts with the decoder predictions
# See the EncoderDecoderModel class for more details
**output = model(**inputs)**
More context is given in the linked pull request, since again this is being copied from there. The initial pull request does not provide the ‘decoder_input_ids’ parameter, but it seems that is now required. My code is similar in eval mode, but without decoder_input_ids, and this code fails :
**for batch in tqdm(eval_dataloader, desc="Evaluating"):
batch = tuple(t.to(args.device) for t in batch)
input_ids, output_ids, input_mask, output_mask, _, decoder_ids = batch
with torch.no_grad():
inputs = {"input_ids": input_ids, "attention_mask": input_mask}
# The output tuple structure depends on the model used and the arguments invoked
# For BERT-type models, this is
# decoder_predictions, encoded_embeddings, encoded_attention_mask = model(**inputs)
# For GPT2-type models, this at least starts with the decoder predictions
# See the EncoderDecoderModel class for more details
output = model(**inputs)**
This code fails in modeling_encoder_decoder, line 283 with
ValueError: You have to specify either input_ids or inputs_embeds
-
The pull request uses a GPT2 model as an example, but that no longer works because the code mentioned from #1 requires some parameters like encoder_hidden_states that GPT2 does not take at initialization. When I try to create a GPT2 model I get exceptions regarding this extra parameter. In other words, when I switch from a bert-bert model to a gpt2-gpt2 model, the code posted above fails in the “forward” method of the EncoderDecoderModel (line 283 of modeling_encoder_decoder) because “encoder_hidden_states” is an unexpected param for GPT2. Is this intended / is GPT2 no longer supported for an encoder decoder architecture using this code?
-
This one is just more of a general question… but since I’m posting the above 2 as issues anyways, I figured I’d add it here in case anybody can clarify and save a separate issue being created…
I believe I’m doing this part correctly, but it was not handled in the example code so want to verify if possible… For the attention mask for the decoder, during training all non-PAD tokens are expected to be unmasked, and during evaluation no mask should be provided and a default causal mask will be used, right?
@patrickvonplaten , tagging you in this issue as requested.
Thank you for your time!! Let me know if you need more code, again my code is 95% or so identical to the run_seq2seq.py example in the linked PR, just with some changes to account for recent modifications in modeling_encoder_decoder.py
Issue Analytics
- State:
- Created 3 years ago
- Comments:21 (8 by maintainers)
Top GitHub Comments
Hey, as usual I’m very late on my own time timelines, but I started working on a Bert2Bert tutorial for summarization yesterday 😃. It’s still work in progress, but it will be ready by next week.
The code should work as it is - I have to fine-tune the hyper parameters and want to add some nicer metrics to measure the performance during training.
If you want to follow the work live 😄 here is the google colab I’m working on at the moment: https://colab.research.google.com/drive/13RXRepDN7bOJgdxPuIziwbcN8mpJLyNo?usp=sharing
@iliemihai, one thing I can directly see from your notebook is that I think you are not masking the loss of padded tokens so that the loss of all pad token id is back propagated through the network.
Since your
decoder_input_ids
are in PyTorch I think you can do the following for yourlabels
:Hi @dbaxter240, Multiple bugs were fixed in #4680. Can you please take a look whether this error persists?
I think ideally you should not copy paste old encoder decoder code into another repo since the code quickly becomes outdated and is hard to debug for us. The
EncoderDecoderModel
is still a very premature feature of this library and prone to change quickly. It would be great if you could try to use as much up-to-date code of this library as possible.I’m very sorry, for only finding this big bug now! It seems like you have invested quite a lot of energy into your code. I will soon (~2 weeks) open-source a notebook giving a nice example of how the
EncoderDecoderModel
can be leverage to fine-tune a Bert2Bert model.Also note that this PR #3402 is rather outdated and since we don’t provide
EncoderDecoderModel
support for GPT2 at the moment still not possible.