Dev Observability
Product
Pricing
Docs
Resources
Blog
Company
Debug Wordle

question-mark

Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

LongT5ForConditionalGeneration NAN losses with bf16

See original GitHub issue

System Info

transformers version: 4.23.0.dev0 torch version: 1.12.1 OS: Ubuntu 20 Cuda: 11.6

The problem is that LongT5 is supposed to work with bf16=True, but it doesn’t. It is known that fp16 should fail in this, and I have tried it and effectively fails. However, Longt5 is supposed to be trained on bf16, therefore it would be expected that turning bf16 to True would work. My training arguments look like this:

{
        "evaluation_strategy": "epoch", 
        "num_train_epochs": 4,
        "do_train": True,
        "do_eval": False,
        "eval_steps": 2,
        "logging_strategy":"epoch",
        "save_strategy": "epoch",
        "save_total_limit": 4,
        "seed": 69,
        "bf16": True, 
        "dataloader_num_workers": 32,
        "adam_epsilon": 1e-8,
        "adam_beta1": 0.9,
        "adam_beta2": 0.999,
        "group_by_length": False,
        "gradient_checkpointing": False,
        "lr_scheduler_type": "linear",
        "learning_rate": 1e-4,
        "per_device_train_batch_size": 1,
        "per_device_eval_batch_size": 1,
        "gradient_accumulation_steps": 64,
        "warmup_ratio": 0.08
    }

Who can help?

@patrickvonplaten

Information

The official example scripts
My own modified scripts

Tasks

An officially supported task in the examples folder (such as GLUE/SQuAD, …)
My own task or dataset (give details below)

Reproduction

If you need a script for reproduction please let me know.

Expected behavior

Longt5 (as I understand from the forum etc) should work with bf16.

Issue Analytics

State:
Created a year ago
Comments:17 (6 by maintainers)

Top GitHub Comments

1reaction

alexvaca0commented, Nov 30, 2022

Great, with the last version it does work!! Thank you very much for helping me ! @ArthurZucker

1reaction

alexvaca0commented, Nov 16, 2022

Okay thanks! Let me know if I can help in some way… 😃 @ArthurZucker

Read more comments on GitHub >

Top Results From Across the Web

T5 fp16 issue is fixed - Transformers - Hugging Face Forums

Previously, there was an issue when using T5 models in fp16 ; it was producing nan loss and logits . Now on the...

My transformer NMT model is giving "nan" loss value - nlp

I am training my transformer model and my model's loss is “nan”. I have tried various workarounds but couldn't figure it out.

half::bf16 - Rust - Docs.rs

Constructs a bf16 value from a 32-bit floating point value. If the 32-bit value is too large to fit, ±∞ will result. NaN...

NaN for loss and measuring metrics - Keras - Stack Overflow

I am using Keras to implement neural network models to predict stock time series data. The code was fine from the tutorial, but...

NaN when training t5-large with bf16 on multiple GPUs issue

I've made a small example below, which I'm running on a machine with 2 A100s. If I do CUDA_VISIBLE_DEVICES=0 python script.py the loss...

Top Related Medium Post

No results found

Top Related StackOverflow Question

No results found

Troubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.

Top Related Reddit Thread

No results found

Top Related Hackernoon Post

No results found

Top Related Tweet

No results found

Top Related Dev.to Post

No results found

Top Related Hashnode Post

No results found

Better documentation for pipelines

ONNX conversion of deberta_v2 models