Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

T5-large FP16 produces nan in loss

See original GitHub issue

Environment info

transformers version: 4.6.0.dev0, commit hash: 5e04d7086803ae4a3892f4082f2835a756592c2c
Platform: Linux-4.15.0-1071-azure-x86_64-with-debian-buster-sid
Python version: 3.7.3
PyTorch version (GPU?): 1.8.1+cu111 (True)
Tensorflow version (GPU?): not installed (NA)
Using GPU in script?: True
Using distributed or parallel set-up in script?: False

Who can help

t5: @patrickvonplaten, @patil-suraj

Information

Model I am using (Bert, XLNet …): t5-large

The problem arises when using:

the official example scripts: (give details below)

The tasks I am working on is:

an official GLUE/SQUaD task: (give the name)

To reproduce

Steps to reproduce the behavior:

cd examples/seq2seq CUDA_VISIBLE_DEVICES=0 PYTHONPATH=…/…/src USE_TF=0 ./run_translation.py
–model_name_or_path t5-large
–do_train --source_lang en --target_lang ro
–source_prefix "translate English to Romanian: "
–dataset_name wmt16 --dataset_config “ro-en”
–output_dir /tmp/tst-translation
–per_device_train_batch_size 4
–overwrite_output_dir
–predict_with_generate
–num_train_epochs 1 --fp16

Expected behavior

FP16 mode shouldn’t produce nan in loss.

Issue Analytics

State:
Created 2 years ago
Comments:5 (2 by maintainers)

Top GitHub Comments

1reaction

stas00commented, Apr 27, 2021

FP16 mode shouldn’t produce nan in loss.

Why do you believe this to be the case? This model was trained in bf16, which has a totally different numerical range from fp16. So it shouldn’t produce NaNs under bf16 or fp32, but under fp16 it’s almost guaranteed to not work. Please see: https://discuss.huggingface.co/t/mixed-precision-for-bfloat16-pretrained-models/5315

That’s said, please try this branch https://github.com/huggingface/transformers/pull/10956 that tries to use a workaround for AMP. Some users reported success. One user reported problems.

And you can also try the new over/underflow detector: https://github.com/huggingface/transformers/pull/11274 if you want to get more precise info on where the problem emerges first. Just add --debug activation_overflow to the trainer command line and it will bail with the traces of the last frames as soon as nan or inf is encountered. I am reworking this tool to provide more info, and need to revamp the interface, but it’s mostly done.

0reactions

github-actions[bot]commented, May 27, 2021

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.