LongT5ForConditionalGeneration NAN losses with bf16
See original GitHub issueSystem Info
transformers version: 4.23.0.dev0 torch version: 1.12.1 OS: Ubuntu 20 Cuda: 11.6
The problem is that LongT5 is supposed to work with bf16=True, but it doesn’t. It is known that fp16 should fail in this, and I have tried it and effectively fails. However, Longt5 is supposed to be trained on bf16, therefore it would be expected that turning bf16 to True would work. My training arguments look like this:
{
"evaluation_strategy": "epoch",
"num_train_epochs": 4,
"do_train": True,
"do_eval": False,
"eval_steps": 2,
"logging_strategy":"epoch",
"save_strategy": "epoch",
"save_total_limit": 4,
"seed": 69,
"bf16": True,
"dataloader_num_workers": 32,
"adam_epsilon": 1e-8,
"adam_beta1": 0.9,
"adam_beta2": 0.999,
"group_by_length": False,
"gradient_checkpointing": False,
"lr_scheduler_type": "linear",
"learning_rate": 1e-4,
"per_device_train_batch_size": 1,
"per_device_eval_batch_size": 1,
"gradient_accumulation_steps": 64,
"warmup_ratio": 0.08
}
Who can help?
Information
- The official example scripts
- My own modified scripts
Tasks
- An officially supported task in the
examples
folder (such as GLUE/SQuAD, …) - My own task or dataset (give details below)
Reproduction
If you need a script for reproduction please let me know.
Expected behavior
Longt5 (as I understand from the forum etc) should work with bf16.
Issue Analytics
- State:
- Created a year ago
- Comments:17 (6 by maintainers)
Top Results From Across the Web
T5 fp16 issue is fixed - Transformers - Hugging Face Forums
Previously, there was an issue when using T5 models in fp16 ; it was producing nan loss and logits . Now on the...
Read more >My transformer NMT model is giving "nan" loss value - nlp
I am training my transformer model and my model's loss is “nan”. I have tried various workarounds but couldn't figure it out.
Read more >half::bf16 - Rust - Docs.rs
Constructs a bf16 value from a 32-bit floating point value. If the 32-bit value is too large to fit, ±∞ will result. NaN...
Read more >NaN for loss and measuring metrics - Keras - Stack Overflow
I am using Keras to implement neural network models to predict stock time series data. The code was fine from the tutorial, but...
Read more >NaN when training t5-large with bf16 on multiple GPUs issue
I've made a small example below, which I'm running on a machine with 2 A100s. If I do CUDA_VISIBLE_DEVICES=0 python script.py the loss...
Read more >Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start FreeTop Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Top GitHub Comments
Great, with the last version it does work!! Thank you very much for helping me ! @ArthurZucker
Okay thanks! Let me know if I can help in some way… 😃 @ArthurZucker