question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Corrupted Relative Attention in T5 Decoder

See original GitHub issue

Environment info

platform: Mac/Ubuntu 14 transformers==2.11.0 torch==1.4.0 (GPU) python 3.6 I know this is an old version but it supports important experiments in a paper under review. Would appreciate to know what’s wrong. I checked the commit log and I don’t think any following commits resolve it.

Who can help

@patrickvonplaten (through slack) @patil-suraj (mentioned below) Please let me know if there is anything else I can provide! Thank you!

Information

I made an artificial binary classification data where the input sequences are near-randomly generated tokens from the T5 vocab. The output sequences are balanced “answer: correct/restaurant” (two binary tag words randomly selected). A data sample can be found here in format (input_seq \t output_seq). The custom data reader parses this data with T5Tokenizer and is_pretokenized=True (see here)

I feed the T5ForConditionalGeneration model (v.2.11.0) with input_ids, lm_labels, and their corresponding attention_masks during training. The model should not learn anything because the sequences are near-random, but in reality, it converges to a zero loss, meaning that the lm_logits from decoder actually attend to future inputs (after shift_right()) and knows the label. During evaluation where I hide the binary tag, the model always predicts positive.

To reproduce

Steps to reproduce the behavior:

  1. Use the code in this repo: https://github.com/Slash0BZ/t5-investigation
  2. Ran with sample data. I have tried both pre-trained T5-large and also randomly initialized T5-Large (written like this)

I am not sure if the training data size affects the result. I ran with a training size of 5M. I am happy to provide the full data and a trained model if actual experiments are needed.

Expected behavior

The training loss converges to near-zero and the lm_logits reflects predictions the same as the output sequence during training. However, in evaluation where the data reader hides the binary tag in the output sequence (achieve through only providing “answer:” in decoder_input_ids), the prediction is uniform.

I also tried to change the decoder_input_ids. When it is [0, 1525, 10, 2024], the prediction at position 2 is 2024. When it is [0, 1525, 10, 2062], the prediction at position 2 is 2062.

Notes: 1525->“answer”, 10->“:”, 2024->“correct”, 2062->“restaurant”

Issue Analytics

  • State:closed
  • Created 3 years ago
  • Comments:16 (8 by maintainers)

github_iconTop GitHub Comments

7reactions
dirkgrcommented, Mar 4, 2021

We use https://github.com/allenai/allennlp/blob/f091cb9cd92e767f55659b2b59f0ffb75bc613be/allennlp/nn/util.py#L239, which ultimately boils down to using this value: torch.finfo(tensor.dtype).min.

3reactions
patil-surajcommented, Mar 5, 2021

@patrickvonplaten I never faced this issue in my T5 experiments but it does seem possible that -10000 can cause some issues because while investigating the fp16 issue we have seen that T5 produces large activation values.

And I agree with @dirkgr solution.

Read more comments on GitHub >

github_iconTop Results From Across the Web

T5: Text-To-Text Transfer Transformer | by Rohan Jagtap
Corrupting Spans: Masking a sequence of words from the sentence and training the model to predict these masked words as shown in the...
Read more >
Review — T5: Text-to-Text Transfer Transformer - Sik-Ho Tsang
A unified framework that converts all text-based language problems into a text-to-text format. Combining with insights from scaling and new C4 “Colossal ...
Read more >
Exploring Google's T5 Text-To-Text Transformer Model - Wandb
In this article, we'll explore the architecture and mechanisms behind Google's T5 Transformer model, from the unified text-to-text framework ...
Read more >
T5 - Hugging Face
To facilitate future work on transfer learning for NLP, we release our dataset, pre-trained models, and code. Tips: T5 is an encoder-decoder model...
Read more >
T5 Exploring the Limits of Transfer Learning with a Unified Text
Relative Positional Self-Attention. • ... BERT-base Size Encoder and Decoder (2x larger) ... i.i.d...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found