question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

T5 fp16 forward yields nan

See original GitHub issue

🐛 Bug

Information

Model I am using (Bert, XLNet …): T5

Language I am using the model on (English, Chinese …): English

The problem arises when using:

  • the official example scripts: (give details below)
  • my own modified scripts: (give details below)

The tasks I am working on is:

  • an official GLUE/SQUaD task: (give the name)
  • my own task or dataset: (give details below)

To reproduce

I use pytorch-lightning to manage fp16. This is the minimal example that reproduces the result.

from transformers import T5Model, T5Tokenizer

tokenizer = T5Tokenizer.from_pretrained("t5-base")
model = T5Model.from_pretrained("t5-base").cuda().half()
text = "hello world!"
inputs = tokenizer.encode(text, return_tensors="pt").cuda()
out = model(input_ids=inputs, decoder_input_ids=inputs)
print(out[0][:, :, :10])

output:

tensor([[[nan, nan, nan, nan, nan, nan, nan, nan, nan, nan],
         [nan, nan, nan, nan, nan, nan, nan, nan, nan, nan],
         [nan, nan, nan, nan, nan, nan, nan, nan, nan, nan]]], device='cuda:0',
       dtype=torch.float16, grad_fn=<SliceBackward>)

Expected behavior

Get non-nan values.

Environment info

  • transformers version: 2.9.0
  • Platform: Linux-4.15.0-88-generic-x86_64-with-debian-buster-sid
  • Python version: 3.7.6
  • PyTorch version (GPU?): 1.4.0 (True)
  • Tensorflow version (GPU?): not installed (NA)
  • Using GPU in script?: Yes
  • Using distributed or parallel set-up in script?: No

Issue Analytics

  • State:closed
  • Created 3 years ago
  • Reactions:2
  • Comments:6 (2 by maintainers)

github_iconTop GitHub Comments

2reactions
patrickvonplatencommented, May 18, 2020

Thanks for the detailed error description @binshengliu! I linked a PR that should fix it 😃

0reactions
SamsTheGreatestcommented, Jul 1, 2021

Same when fine-tuning GPT Neo.

Read more comments on GitHub >

github_iconTop Results From Across the Web

T5 fp16 issue is fixed - Transformers - Hugging Face Forums
Previously, there was an issue when using T5 models in fp16 ; it was producing nan loss and logits . Now on the...
Read more >
FP16 model Inference on GPU gives all Nan values in output ...
Hi there, I'm working on a project that involves Token2Token-Vision Transformer for classification task. Information on model here. During conversion.
Read more >
How to avoid huggingface t5-based seq to seq suddenly ...
My main question here is: Why would this result in the yielded loss suddenly becoming nan and the model, if .backwards is called...
Read more >
Release Notes :: NVIDIA Deep Learning cuDNN Documentation
Additional tensor layout support was added for the forward and backwards ... For packed NCHW tensors using the FP16 datatype, cuDNN attempts to...
Read more >
Nan Loss with torch.cuda.amp and CrossEntropyLoss
I am trying to train a DDP model (one GPU per process, but I've added the with autocast(enabled=args.use_mp): to model forward just in...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found