Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

T5 fp16 forward yields nan

See original GitHub issue

🐛 Bug

Information

Model I am using (Bert, XLNet …): T5

Language I am using the model on (English, Chinese …): English

The problem arises when using:

the official example scripts: (give details below)
my own modified scripts: (give details below)

The tasks I am working on is:

an official GLUE/SQUaD task: (give the name)
my own task or dataset: (give details below)

To reproduce

I use pytorch-lightning to manage fp16. This is the minimal example that reproduces the result.

from transformers import T5Model, T5Tokenizer

tokenizer = T5Tokenizer.from_pretrained("t5-base")
model = T5Model.from_pretrained("t5-base").cuda().half()
text = "hello world!"
inputs = tokenizer.encode(text, return_tensors="pt").cuda()
out = model(input_ids=inputs, decoder_input_ids=inputs)
print(out[0][:, :, :10])

output:

tensor([[[nan, nan, nan, nan, nan, nan, nan, nan, nan, nan],
         [nan, nan, nan, nan, nan, nan, nan, nan, nan, nan],
         [nan, nan, nan, nan, nan, nan, nan, nan, nan, nan]]], device='cuda:0',
       dtype=torch.float16, grad_fn=<SliceBackward>)

Expected behavior

Get non-nan values.

Environment info

transformers version: 2.9.0
Platform: Linux-4.15.0-88-generic-x86_64-with-debian-buster-sid
Python version: 3.7.6
PyTorch version (GPU?): 1.4.0 (True)
Tensorflow version (GPU?): not installed (NA)
Using GPU in script?: Yes
Using distributed or parallel set-up in script?: No

Issue Analytics

State:
Created 3 years ago
Reactions:2
Comments:6 (2 by maintainers)

Top GitHub Comments

2reactions

patrickvonplatencommented, May 18, 2020

Thanks for the detailed error description @binshengliu! I linked a PR that should fix it 😃

0reactions

SamsTheGreatestcommented, Jul 1, 2021

Same when fine-tuning GPT Neo.

Top Results From Across the Web

T5 fp16 issue is fixed - Transformers - Hugging Face Forums

Previously, there was an issue when using T5 models in fp16 ; it was producing nan loss and logits . Now on the...

FP16 model Inference on GPU gives all Nan values in output ...

Hi there, I'm working on a project that involves Token2Token-Vision Transformer for classification task. Information on model here. During conversion.

How to avoid huggingface t5-based seq to seq suddenly ...

My main question here is: Why would this result in the yielded loss suddenly becoming nan and the model, if .backwards is called...

Release Notes :: NVIDIA Deep Learning cuDNN Documentation

Additional tensor layout support was added for the forward and backwards ... For packed NCHW tensors using the FP16 datatype, cuDNN attempts to...

Nan Loss with torch.cuda.amp and CrossEntropyLoss

I am trying to train a DDP model (one GPU per process, but I've added the with autocast(enabled=args.use_mp): to model forward just in...