Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

T5-v1.1 loss go to nan when fp16 training was enabled

See original GitHub issue

Environment info

I test in two different environments. One is my native env, one is nvidia container pytorch_21.09. For more details, please refer https://docs.nvidia.com/deeplearning/frameworks/pytorch-release-notes/rel_21-09.html#rel_21-09

transformers version: 4.11.3
Platform: Arch Linux 5.14.14-arch1-1 (Ubuntu 20.04)
Python version: 3.9.7 (3.8)
PyTorch version (GPU?): 1.9.1 (1.10a)
Tensorflow version (GPU?): 2.6.0 (did not use)
Using GPU in script?: 2080Ti (V100)
Using distributed or parallel set-up in script?: using fp16

Who can help

@patrickvonplaten, @patil-suraj

Information

Model, I am using t5-v1.1 (small, base) with mix-precision, loss would go to nan.

The problem arises when using:

the official example scripts: (give details below)
my own modified scripts: (give details below)

The tasks I am working on is:

an official GLUE/SQUaD task: (give the name)
my own task or dataset: (give details below)

The bug can be reproduced with run_summarization & run_summarization_no_trainer.py

To reproduce

Steps to reproduce the behavior:

1.❯ Both the following scrips can reproduce the results

python run_summarization.py \
    --fp16 --fp16_backend apex (both native amp & apex face thes same issue)\
    --model_name_or_path google/t5-v1_1-base \
    --do_train \
    --do_eval \
    --dataset_name cnn_dailymail \
    --dataset_config "3.0.0" \
    --source_prefix "summarize: " \
    --output_dir /tmp/tst-summarization \
    --per_device_train_batch_size=2 \
    --per_device_eval_batch_size=2 \
    --overwrite_output_dir \

accelerate launch --fp16 run_summarization_no_trainer.py \
    --model_name_or_path google/t5-v1_1-base \
    --dataset_name cnn_dailymail \
    --dataset_config "3.0.0" \
    --source_prefix "summarize: " \
    --per_device_train_batch_size=2 \
    --output_dir ~/tmp/tst-summarization \

If you print the loss step by step, you will find out loss goes to nan. (for Trainer, I print the loss before trainer.trainig_step return)

Possible Reason

In https://github.com/huggingface/transformers/pull/10496, models clamp inf values only when hidden_states.dtype == torch.float16. However, even when fp16 training is enabled, the hidden_states.dtype is still torch.float32. This might be due to the layer_norm operation.

Here are some more informations that might be useful to you.

When using BART and T5 with fp16 training, the hidden_states.dtype is still torch.float32, however; their loss won’t go to nan.

Issue Analytics

State:
Created 2 years ago
Reactions:1
Comments:8 (8 by maintainers)

Top GitHub Comments

3reactions

Liangtaiwancommented, Nov 3, 2021

@stas00 @patrickvonplaten @LysandreJik PR #10956 does prevent T5 from going nan and achieving a comparable result in fp32. Close the issue and move to PR #10956 to discuss.

2reactions

ibeltagycommented, Dec 20, 2021

I am working with @HaokunLiu on a project that uses T5 and he found a great solution to this problem. The idea is to scale down the weights of the model in a specific pattern that maintains the relationship between the weights. I am not sure if this transformation is loss-preserving, but logits.argmax should remain the same.

Here’s his script

import torch
from transformers import T5ForConditionalGeneration


emb_scaling = 1 / 32.0
att_v_scaling = 1 / 4.0
att_o_scaling = 1 / 8.0
ff_wi_scaling = 1 / 4.0
ff_wo_scaling = 1 / 4.0
ff_ln_scaling = 1 / 2.0

assert att_v_scaling * att_o_scaling == emb_scaling
assert ff_wi_scaling * ff_wo_scaling * ff_ln_scaling == emb_scaling

new_model = T5ForConditionalGeneration.from_pretrained('t5-base')
with torch.no_grad():
    new_model.shared.weight *= emb_scaling
    for unit in new_model.encoder.block:
        unit.layer[0].SelfAttention.v.weight *= att_v_scaling
        unit.layer[0].SelfAttention.o.weight *= att_o_scaling
        unit.layer[1].DenseReluDense.wi.weight *= ff_wi_scaling
        unit.layer[1].DenseReluDense.wo.weight *= ff_wo_scaling
        unit.layer[1].layer_norm.weight *= ff_ln_scaling
    for unit in new_model.decoder.block:
        unit.layer[0].SelfAttention.v.weight *= att_v_scaling
        unit.layer[0].SelfAttention.o.weight *= att_o_scaling
        unit.layer[1].EncDecAttention.v.weight *= att_v_scaling
        unit.layer[1].EncDecAttention.o.weight *= att_o_scaling
        unit.layer[2].DenseReluDense.wi.weight *= ff_wi_scaling
        unit.layer[2].DenseReluDense.wo.weight *= ff_wo_scaling
        unit.layer[2].layer_norm.weight *= ff_ln_scaling
    new_model.lm_scale_modifier /= emb_scaling

new_model.save_pretrained('t5-base-fp16-fixed')

in __init__

https://github.com/huggingface/transformers/blob/84ea427f460ffc8d2ddc08a341ccda076c24fc1f/src/transformers/models/t5/modeling_t5.py#L1461

you need to add:

self.lm_scale_modifier = nn.Parameter(torch.ones(config.d_model))

then in the forward

https://github.com/huggingface/transformers/blob/84ea427f460ffc8d2ddc08a341ccda076c24fc1f/src/transformers/models/t5/modeling_t5.py#L1640

function you need the following lines here

sequence_output = sequence_output * self.lm_scale_modifier  # new code
lm_logits = self.lm_head(sequence_output)                   # existing code

Top Results From Across the Web

T5 fp16 issue is fixed - Transformers - Hugging Face Forums

Previously, there was an issue when using T5 models in fp16 ; it was producing nan loss and logits . Now on the...

How to avoid nan loss when using fp16 training? - nlp

Hi, I am using roberta-base to train RTE dataset. When I use torch.half() to change my models parameters, I find that after first...

Why I am getting tensor of NaN values in PyTorch ...

I am fine-tuning distil-bert model for 200k iterations. Once it saves the checkpoint file, I do the inference. However, my inference vector ...

Mixed precision training - Advanced (Part 1 v3) - Fast.ai forums

Has anyone been able to try Mixed precision training? I am using a v100 on GCP. When I start with fp16() it helps...

Capturing a Training State in TensorFlow | by Chaim Rand

How to debug occurrences of a NaN loss in your training project ... refrain from going into the details of the unique challenges...

Troubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.

Start Free

Top Related Reddit Thread

No results found

Top Related Tweet

No results found

Top Related Dev.to Post

No results found

T5-v1.1 loss go to nan when fp16 training was enabled

Environment info

Who can help

Information

To reproduce

Possible Reason

Issue Analytics

Top GitHub Comments

Top Results From Across the Web

Top Related Medium Post

Top Related StackOverflow Question

Troubleshoot Live Code

Top Related Reddit Thread

Top Related Hackernoon Post

Top Related Tweet

Top Related Dev.to Post

Top Related Hashnode Post

Trainer batch size auto scaling

ValueError when converting dialogpt to onnx format