question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

LongT5ForConditionalGeneration NAN losses with bf16

See original GitHub issue

System Info

transformers version: 4.23.0.dev0 torch version: 1.12.1 OS: Ubuntu 20 Cuda: 11.6

The problem is that LongT5 is supposed to work with bf16=True, but it doesn’t. It is known that fp16 should fail in this, and I have tried it and effectively fails. However, Longt5 is supposed to be trained on bf16, therefore it would be expected that turning bf16 to True would work. My training arguments look like this:

{
        "evaluation_strategy": "epoch", 
        "num_train_epochs": 4,
        "do_train": True,
        "do_eval": False,
        "eval_steps": 2,
        "logging_strategy":"epoch",
        "save_strategy": "epoch",
        "save_total_limit": 4,
        "seed": 69,
        "bf16": True, 
        "dataloader_num_workers": 32,
        "adam_epsilon": 1e-8,
        "adam_beta1": 0.9,
        "adam_beta2": 0.999,
        "group_by_length": False,
        "gradient_checkpointing": False,
        "lr_scheduler_type": "linear",
        "learning_rate": 1e-4,
        "per_device_train_batch_size": 1,
        "per_device_eval_batch_size": 1,
        "gradient_accumulation_steps": 64,
        "warmup_ratio": 0.08
    }

Who can help?

@patrickvonplaten

Information

  • The official example scripts
  • My own modified scripts

Tasks

  • An officially supported task in the examples folder (such as GLUE/SQuAD, …)
  • My own task or dataset (give details below)

Reproduction

If you need a script for reproduction please let me know.

Expected behavior

Longt5 (as I understand from the forum etc) should work with bf16.

Issue Analytics

  • State:closed
  • Created a year ago
  • Comments:17 (6 by maintainers)

github_iconTop GitHub Comments

1reaction
alexvaca0commented, Nov 30, 2022

Great, with the last version it does work!! Thank you very much for helping me ! @ArthurZucker

1reaction
alexvaca0commented, Nov 16, 2022

Okay thanks! Let me know if I can help in some way… 😃 @ArthurZucker

Read more comments on GitHub >

github_iconTop Results From Across the Web

T5 fp16 issue is fixed - Transformers - Hugging Face Forums
Previously, there was an issue when using T5 models in fp16 ; it was producing nan loss and logits . Now on the...
Read more >
My transformer NMT model is giving "nan" loss value - nlp
I am training my transformer model and my model's loss is “nan”. I have tried various workarounds but couldn't figure it out.
Read more >
half::bf16 - Rust - Docs.rs
Constructs a bf16 value from a 32-bit floating point value. If the 32-bit value is too large to fit, ±∞ will result. NaN...
Read more >
NaN for loss and measuring metrics - Keras - Stack Overflow
I am using Keras to implement neural network models to predict stock time series data. The code was fine from the tutorial, but...
Read more >
NaN when training t5-large with bf16 on multiple GPUs issue
I've made a small example below, which I'm running on a machine with 2 A100s. If I do CUDA_VISIBLE_DEVICES=0 python script.py the loss...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found