question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

T5Model in fp16 still yield nan with more complex examples

See original GitHub issue

🐛 Bug

Hello, thank you for the recent PR with fp16 fixes. It seems to work well with short inputs, but once the model is fed with some more complex data it still yields nans.

Information

Model I am using: T5

Language I am using the model on: English

The problem arises when using:

  • the official example scripts: (give details below)
  • my own modified scripts: (give details below)

The tasks I am working on is:

  • an official GLUE/SQUaD task: (give the name)
  • my own task or dataset: (give details below)

To reproduce

Run the code:

from transformers import T5Model
import torch

model = T5Model.from_pretrained("t5-base").cuda().half().eval()
inputs = torch.tensor([[37,423,215,1504,13,8,1186,10670,11,10449,49,1152,11363,15465,1514,5,4433,399,7863,24766,15,17,965,594,5386,14286,28,8,6,5,755,5781,32099,993,3744,21,8,2367,18,458,53,16616,32098,16,32097,7660,16409,77,19,3,107,13164,1054,32096,993,1970,9368,948,147,8,15465,5861,87,25481,788,12,8,32095,1300,61,37,423,215,1504,13,3,24151,40,3,19668,594,5386,14286,28,8,3,115,13164]]).cuda()
decoder_input_ids = torch.tensor([[21820, 296, 55]]).cuda()

out = model(input_ids=inputs, decoder_input_ids=decoder_input_ids)
# encoder outputs
out[2][:,:2]

output:

tensor([[[nan, nan, nan,  ..., nan, nan, nan],
         [nan, nan, nan,  ..., nan, nan, nan]]], device='cuda:0',
       dtype=torch.float16, grad_fn=<SliceBackward>)

Expected behavior

Output with non-nan values.

Environment info

  • transformers version: 2.10.0
  • Platform: Linux-4.15.0-88-generic-x86_64-with-debian-buster-sid
  • Python version: 3.6.10
  • PyTorch version (GPU?): 1.4.0 (True)
  • Tensorflow version (GPU?): not installed (NA)
  • Using GPU in script?: yes
  • Using distributed or parallel set-up in script?: no

Issue Analytics

  • State:closed
  • Created 3 years ago
  • Comments:18 (7 by maintainers)

github_iconTop GitHub Comments

7reactions
calclaviacommented, Jun 8, 2020

@patrickvonplaten Even with O1 I tried fine-tuning T5-base, and in less than 100 iterations, it will converge to nan values quickly. Seems like the stability of this model is poor. Perhaps first few iterations of fine-tuning require FP32.

4reactions
leecmingcommented, Dec 15, 2020

Ran into this issue and found a workaround to get FP16 training working. T5DenseGatedGeluDense doesn’t play nice with FP16, specifically the final dense layer to resize from d_ff to d_model. I used pytorch’s autocast/gradscaler mixed precision implementation and created an exception for that specific dense layer.

class T5DenseGatedGeluDense(nn.Module):
    def __init__(self, config):
        super().__init__()
        self.wi_0 = nn.Linear(config.d_model, config.d_ff, bias=False)
        self.wi_1 = nn.Linear(config.d_model, config.d_ff, bias=False)
        self.wo = nn.Linear(config.d_ff, config.d_model, bias=False)
        self.dropout = nn.Dropout(config.dropout_rate)
        self.gelu_act = ACT2FN["gelu_new"]

    def forward(self, hidden_states):
        hidden_gelu = self.gelu_act(self.wi_0(hidden_states))
        hidden_linear = self.wi_1(hidden_states)
        hidden_states = hidden_gelu * hidden_linear
        hidden_states = self.dropout(hidden_states)
        with autocast(enabled=False):
            hidden_states = self.wo(hidden_states)
        return hidden_states
Read more comments on GitHub >

github_iconTop Results From Across the Web

T5 fp16 issue is fixed - Transformers - Hugging Face Forums
Previously, there was an issue when using T5 models in fp16 ; it was producing nan loss and logits . Now on the...
Read more >
Fine-tuning giant neural networks on ... - Mark Silberstein
Fine-tuning is an increasingly common technique that lever- ages transfer learning to dramatically expedite the training of huge, high-quality models.
Read more >
Fine-tuning giant neural networks on commodity ... - USENIX
Fine-tuning is an increasingly common technique that lever- ages transfer learning to dramatically expedite the training of.
Read more >
arXiv:2109.03659v1 [cs.CL] 8 Sep 2021
In our experiments on TACRED we attain 63%. F1 zero-shot, 69% with 16 examples per re- lation (17% points better than the best...
Read more >
2022 Challenges & Perspectives in Creating Large Language ...
utilizing more data for distillation. Formally, each time the model receives a mini-batch of stream examples xs or a draws mini-batch of ...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found