Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Corrupted Relative Attention in T5 Decoder

See original GitHub issue

Environment info

platform: Mac/Ubuntu 14 transformers==2.11.0 torch==1.4.0 (GPU) python 3.6 I know this is an old version but it supports important experiments in a paper under review. Would appreciate to know what’s wrong. I checked the commit log and I don’t think any following commits resolve it.

Who can help

@patrickvonplaten (through slack) @patil-suraj (mentioned below) Please let me know if there is anything else I can provide! Thank you!

Information

I made an artificial binary classification data where the input sequences are near-randomly generated tokens from the T5 vocab. The output sequences are balanced “answer: correct/restaurant” (two binary tag words randomly selected). A data sample can be found here in format (input_seq \t output_seq). The custom data reader parses this data with T5Tokenizer and is_pretokenized=True (see here)

I feed the T5ForConditionalGeneration model (v.2.11.0) with input_ids, lm_labels, and their corresponding attention_masks during training. The model should not learn anything because the sequences are near-random, but in reality, it converges to a zero loss, meaning that the lm_logits from decoder actually attend to future inputs (after shift_right()) and knows the label. During evaluation where I hide the binary tag, the model always predicts positive.

To reproduce

Steps to reproduce the behavior:

Use the code in this repo: https://github.com/Slash0BZ/t5-investigation
Ran with sample data. I have tried both pre-trained T5-large and also randomly initialized T5-Large (written like this)

I am not sure if the training data size affects the result. I ran with a training size of 5M. I am happy to provide the full data and a trained model if actual experiments are needed.

Expected behavior

The training loss converges to near-zero and the lm_logits reflects predictions the same as the output sequence during training. However, in evaluation where the data reader hides the binary tag in the output sequence (achieve through only providing “answer:” in decoder_input_ids), the prediction is uniform.

I also tried to change the decoder_input_ids. When it is [0, 1525, 10, 2024], the prediction at position 2 is 2024. When it is [0, 1525, 10, 2062], the prediction at position 2 is 2062.

Notes: 1525->“answer”, 10->“:”, 2024->“correct”, 2062->“restaurant”

Issue Analytics

State:
Created 3 years ago
Comments:16 (8 by maintainers)

Top GitHub Comments

7reactions

dirkgrcommented, Mar 4, 2021

We use https://github.com/allenai/allennlp/blob/f091cb9cd92e767f55659b2b59f0ffb75bc613be/allennlp/nn/util.py#L239, which ultimately boils down to using this value: torch.finfo(tensor.dtype).min.

3reactions

patil-surajcommented, Mar 5, 2021

@patrickvonplaten I never faced this issue in my T5 experiments but it does seem possible that -10000 can cause some issues because while investigating the fp16 issue we have seen that T5 produces large activation values.

And I agree with @dirkgr solution.

Top Results From Across the Web

T5: Text-To-Text Transfer Transformer | by Rohan Jagtap

Corrupting Spans: Masking a sequence of words from the sentence and training the model to predict these masked words as shown in the...

Review — T5: Text-to-Text Transfer Transformer - Sik-Ho Tsang

A unified framework that converts all text-based language problems into a text-to-text format. Combining with insights from scaling and new C4 “Colossal ...

Exploring Google's T5 Text-To-Text Transformer Model - Wandb

In this article, we'll explore the architecture and mechanisms behind Google's T5 Transformer model, from the unified text-to-text framework ...

T5 - Hugging Face

To facilitate future work on transfer learning for NLP, we release our dataset, pre-trained models, and code. Tips: T5 is an encoder-decoder model...

T5 Exploring the Limits of Transfer Learning with a Unified Text

Relative Positional Self-Attention. • ... BERT-base Size Encoder and Decoder (2x larger) ... i.i.d...