Corrupted Relative Attention in T5 Decoder
See original GitHub issueEnvironment info
platform: Mac/Ubuntu 14 transformers==2.11.0 torch==1.4.0 (GPU) python 3.6 I know this is an old version but it supports important experiments in a paper under review. Would appreciate to know what’s wrong. I checked the commit log and I don’t think any following commits resolve it.
Who can help
@patrickvonplaten (through slack) @patil-suraj (mentioned below) Please let me know if there is anything else I can provide! Thank you!
Information
I made an artificial binary classification data where the input sequences are near-randomly generated tokens from the T5 vocab. The output sequences are balanced “answer: correct/restaurant
” (two binary tag words randomly selected). A data sample can be found here in format (input_seq \t output_seq
). The custom data reader parses this data with T5Tokenizer and is_pretokenized=True (see here)
I feed the T5ForConditionalGeneration model (v.2.11.0) with input_ids, lm_labels, and their corresponding attention_masks during training. The model should not learn anything because the sequences are near-random, but in reality, it converges to a zero loss, meaning that the lm_logits from decoder actually attend to future inputs (after shift_right()
) and knows the label. During evaluation where I hide the binary tag, the model always predicts positive.
To reproduce
Steps to reproduce the behavior:
- Use the code in this repo: https://github.com/Slash0BZ/t5-investigation
- Ran with sample data. I have tried both pre-trained T5-large and also randomly initialized T5-Large (written like this)
I am not sure if the training data size affects the result. I ran with a training size of 5M. I am happy to provide the full data and a trained model if actual experiments are needed.
Expected behavior
The training loss converges to near-zero and the lm_logits reflects predictions the same as the output sequence during training. However, in evaluation where the data reader hides the binary tag in the output sequence (achieve through only providing “answer:” in decoder_input_ids), the prediction is uniform.
I also tried to change the decoder_input_ids. When it is [0, 1525, 10, 2024], the prediction at position 2 is 2024. When it is [0, 1525, 10, 2062], the prediction at position 2 is 2062.
Notes: 1525->“answer”, 10->“:”, 2024->“correct”, 2062->“restaurant”
Issue Analytics
- State:
- Created 3 years ago
- Comments:16 (8 by maintainers)
Top GitHub Comments
We use https://github.com/allenai/allennlp/blob/f091cb9cd92e767f55659b2b59f0ffb75bc613be/allennlp/nn/util.py#L239, which ultimately boils down to using this value:
torch.finfo(tensor.dtype).min
.@patrickvonplaten I never faced this issue in my T5 experiments but it does seem possible that -10000 can cause some issues because while investigating the fp16 issue we have seen that T5 produces large activation values.
And I agree with @dirkgr solution.