Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

LongT5 Models Are Not Initialized With Pretrained Weights

See original GitHub issue

System Info

transformers version: 4.20.1

Platform: Linux-5.4.188±x86_64-with-Ubuntu-18.04-bionic
Python version: 3.7.13
Huggingface_hub version: 0.8.1
PyTorch version (GPU?): 1.11.0+cu113 (True)
Tensorflow version (GPU?): 2.8.2 (True)
Flax version (CPU?/GPU?/TPU?): not installed (NA)
Jax version: not installed
JaxLib version: not installed
Using GPU in script?: Yes
Using distributed or parallel set-up in script?: No

Who can help?

@LysandreJik @stancld @patrickvonplaten

Information

The official example scripts
My own modified scripts

Tasks

An officially supported task in the examples folder (such as GLUE/SQuAD, …)
My own task or dataset (give details below)

Reproduction

I have tried using LongT5 fine-tuning on a long range summarization task on a custom dataset (consider it like CNN/DM in that it is highly extractive). While long-t5-tglobal-base works well (I am able to converge on a validation loss of ~1.25 and ROUGE-2 of ~21), the long-t5-local-base, long-t5-local-large, and long-t5-tglobal-large all end up getting me training/validation losses of 200+ with ROUGE scores of exactly 0, making me believe that these models haven’t actually been initialized with Google’s weights. Here are the json outputs associated with trainer.evaluate() after 1 epoch of training:

google/long-t5-local-base {‘epoch’: 1.0, ‘eval_gen_len’: 1023.0, ‘eval_loss’: 366.21673583984375, ‘eval_rouge1’: 0.0, ‘eval_rouge2’: 0.0, ‘eval_rougeL’: 0.0, ‘eval_rougeLsum’: 0.0, ‘eval_runtime’: 37.9896, ‘eval_samples_per_second’: 0.132, ‘eval_steps_per_second’: 0.053}

google/long-t5-tglobal-base (This one works correctly) {‘epoch’: 1.0, ‘eval_gen_len’: 708.2, ‘eval_loss’: 1.6017440557479858, ‘eval_rouge1’: 35.7791, ‘eval_rouge2’: 11.5732, ‘eval_rougeL’: 19.1541, ‘eval_rougeLsum’: 31.8491, ‘eval_runtime’: 34.8695, ‘eval_samples_per_second’: 0.143, ‘eval_steps_per_second’: 0.057}

google/long-t5-local-large {‘epoch’: 0.77, ‘eval_gen_len’: 1023.0, ‘eval_loss’: 252.44662475585938, ‘eval_rouge1’: 0.0, ‘eval_rouge2’: 0.0, ‘eval_rougeL’: 0.0, ‘eval_rougeLsum’: 0.0, ‘eval_runtime’: 89.2506, ‘eval_samples_per_second’: 0.056, ‘eval_steps_per_second’: 0.034}

google/long-t5-tglobal-large {‘epoch’: 0.77, ‘eval_gen_len’: 1023.0, ‘eval_loss’: 241.6276397705078, ‘eval_rouge1’: 0.0, ‘eval_rouge2’: 0.0, ‘eval_rougeL’: 0.0, ‘eval_rougeLsum’: 0.0, ‘eval_runtime’: 89.9801, ‘eval_samples_per_second’: 0.056, ‘eval_steps_per_second’: 0.033}

For reproduction, just run the standard Huggingface PyTorch training script for summarization on any official dataset (CNN/DM, XSum, etc.).

Note that I haven’t tried the 3B parameter versions so cannot speak to whether this problem affects them as well.

Expected behavior

All four models should have a low validation loss when fine tuning on summarization (as opposed to three of them having 300+ validation losses as if they are randomly initialized).

Issue Analytics

State:
Created a year ago
Comments:13 (6 by maintainers)

Top GitHub Comments

1reaction

reelmathcommented, Jul 9, 2022

Update 2: Loading from flax works for longt5-tglobal-large and longt5-local-base, but does not work for longt5-local-large (which starts and flatlines at a training and validation loss of around 10).

0reactions

github-actions[bot]commented, Aug 21, 2022

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

Top Results From Across the Web

LongT5 - Hugging Face

Initializing with a config file does not load the weights associated with the model, only the configuration. Check out the from_pretrained() method to...

Weights of pre-trained BERT model not initialized

This will issue a warning about some of the pretrained weights not being used and some weights being randomly initialized.

arXiv:2209.10052v2 [cs.CL] 16 Nov 2022

L) complexity. 2The projection layers to create these matrices are not used in existing pretrained models and will be randomly initialized.

LongT5: Efficient Text-To-Text Transformer for ... - YouTube

t5 #transformers #nlpLongT5 explores the effect of scaling both the input length and model size of T5 at the same time with some...

Exploring Google's T5 Text-To-Text Transformer Model - Wandb

Models :ResultsT5 vs LongT5When You Would Use The T5 Model?0. ... pinnacle of transfer learning is when a single standalone model pre-trained ......