question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

T5-v1.1 loss go to nan when fp16 training was enabled

See original GitHub issue

Environment info

I test in two different environments. One is my native env, one is nvidia container pytorch_21.09. For more details, please refer https://docs.nvidia.com/deeplearning/frameworks/pytorch-release-notes/rel_21-09.html#rel_21-09

  • transformers version: 4.11.3
  • Platform: Arch Linux 5.14.14-arch1-1 (Ubuntu 20.04)
  • Python version: 3.9.7 (3.8)
  • PyTorch version (GPU?): 1.9.1 (1.10a)
  • Tensorflow version (GPU?): 2.6.0 (did not use)
  • Using GPU in script?: 2080Ti (V100)
  • Using distributed or parallel set-up in script?: using fp16

Who can help

@patrickvonplaten, @patil-suraj

Information

Model, I am using t5-v1.1 (small, base) with mix-precision, loss would go to nan.

The problem arises when using:

  • the official example scripts: (give details below)
  • my own modified scripts: (give details below)

The tasks I am working on is:

  • an official GLUE/SQUaD task: (give the name)
  • my own task or dataset: (give details below)

The bug can be reproduced with run_summarization & run_summarization_no_trainer.py

To reproduce

Steps to reproduce the behavior:

1.❯ Both the following scrips can reproduce the results

python run_summarization.py \
    --fp16 --fp16_backend apex (both native amp & apex face thes same issue)\
    --model_name_or_path google/t5-v1_1-base \
    --do_train \
    --do_eval \
    --dataset_name cnn_dailymail \
    --dataset_config "3.0.0" \
    --source_prefix "summarize: " \
    --output_dir /tmp/tst-summarization \
    --per_device_train_batch_size=2 \
    --per_device_eval_batch_size=2 \
    --overwrite_output_dir \
accelerate launch --fp16 run_summarization_no_trainer.py \
    --model_name_or_path google/t5-v1_1-base \
    --dataset_name cnn_dailymail \
    --dataset_config "3.0.0" \
    --source_prefix "summarize: " \
    --per_device_train_batch_size=2 \
    --output_dir ~/tmp/tst-summarization \
  1. If you print the loss step by step, you will find out loss goes to nan. (for Trainer, I print the loss before trainer.trainig_step return)

Possible Reason

In https://github.com/huggingface/transformers/pull/10496, models clamp inf values only when hidden_states.dtype == torch.float16. However, even when fp16 training is enabled, the hidden_states.dtype is still torch.float32. This might be due to the layer_norm operation.

Here are some more informations that might be useful to you.

When using BART and T5 with fp16 training, the hidden_states.dtype is still torch.float32, however; their loss won’t go to nan.

Issue Analytics

  • State:closed
  • Created 2 years ago
  • Reactions:1
  • Comments:8 (8 by maintainers)

github_iconTop GitHub Comments

3reactions
Liangtaiwancommented, Nov 3, 2021

@stas00 @patrickvonplaten @LysandreJik PR #10956 does prevent T5 from going nan and achieving a comparable result in fp32. Close the issue and move to PR #10956 to discuss.

2reactions
ibeltagycommented, Dec 20, 2021

I am working with @HaokunLiu on a project that uses T5 and he found a great solution to this problem. The idea is to scale down the weights of the model in a specific pattern that maintains the relationship between the weights. I am not sure if this transformation is loss-preserving, but logits.argmax should remain the same.

Here’s his script

import torch
from transformers import T5ForConditionalGeneration


emb_scaling = 1 / 32.0
att_v_scaling = 1 / 4.0
att_o_scaling = 1 / 8.0
ff_wi_scaling = 1 / 4.0
ff_wo_scaling = 1 / 4.0
ff_ln_scaling = 1 / 2.0

assert att_v_scaling * att_o_scaling == emb_scaling
assert ff_wi_scaling * ff_wo_scaling * ff_ln_scaling == emb_scaling

new_model = T5ForConditionalGeneration.from_pretrained('t5-base')
with torch.no_grad():
    new_model.shared.weight *= emb_scaling
    for unit in new_model.encoder.block:
        unit.layer[0].SelfAttention.v.weight *= att_v_scaling
        unit.layer[0].SelfAttention.o.weight *= att_o_scaling
        unit.layer[1].DenseReluDense.wi.weight *= ff_wi_scaling
        unit.layer[1].DenseReluDense.wo.weight *= ff_wo_scaling
        unit.layer[1].layer_norm.weight *= ff_ln_scaling
    for unit in new_model.decoder.block:
        unit.layer[0].SelfAttention.v.weight *= att_v_scaling
        unit.layer[0].SelfAttention.o.weight *= att_o_scaling
        unit.layer[1].EncDecAttention.v.weight *= att_v_scaling
        unit.layer[1].EncDecAttention.o.weight *= att_o_scaling
        unit.layer[2].DenseReluDense.wi.weight *= ff_wi_scaling
        unit.layer[2].DenseReluDense.wo.weight *= ff_wo_scaling
        unit.layer[2].layer_norm.weight *= ff_ln_scaling
    new_model.lm_scale_modifier /= emb_scaling

new_model.save_pretrained('t5-base-fp16-fixed')

in __init__

https://github.com/huggingface/transformers/blob/84ea427f460ffc8d2ddc08a341ccda076c24fc1f/src/transformers/models/t5/modeling_t5.py#L1461

you need to add:

self.lm_scale_modifier = nn.Parameter(torch.ones(config.d_model))

then in the forward

https://github.com/huggingface/transformers/blob/84ea427f460ffc8d2ddc08a341ccda076c24fc1f/src/transformers/models/t5/modeling_t5.py#L1640

function you need the following lines here

sequence_output = sequence_output * self.lm_scale_modifier  # new code
lm_logits = self.lm_head(sequence_output)                   # existing code
Read more comments on GitHub >

github_iconTop Results From Across the Web

T5 fp16 issue is fixed - Transformers - Hugging Face Forums
Previously, there was an issue when using T5 models in fp16 ; it was producing nan loss and logits . Now on the...
Read more >
How to avoid nan loss when using fp16 training? - nlp
Hi, I am using roberta-base to train RTE dataset. When I use torch.half() to change my models parameters, I find that after first...
Read more >
Why I am getting tensor of NaN values in PyTorch ...
I am fine-tuning distil-bert model for 200k iterations. Once it saves the checkpoint file, I do the inference. However, my inference vector ...
Read more >
Mixed precision training - Advanced (Part 1 v3) - Fast.ai forums
Has anyone been able to try Mixed precision training? I am using a v100 on GCP. When I start with fp16() it helps...
Read more >
Capturing a Training State in TensorFlow | by Chaim Rand
How to debug occurrences of a NaN loss in your training project ... refrain from going into the details of the unique challenges...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found