T5-large FP16 produces nan in loss
See original GitHub issueEnvironment info
transformers
version: 4.6.0.dev0, commit hash: 5e04d7086803ae4a3892f4082f2835a756592c2c- Platform: Linux-4.15.0-1071-azure-x86_64-with-debian-buster-sid
- Python version: 3.7.3
- PyTorch version (GPU?): 1.8.1+cu111 (True)
- Tensorflow version (GPU?): not installed (NA)
- Using GPU in script?: True
- Using distributed or parallel set-up in script?: False
Who can help
t5: @patrickvonplaten, @patil-suraj
Information
Model I am using (Bert, XLNet …): t5-large
The problem arises when using:
- the official example scripts: (give details below)
The tasks I am working on is:
- an official GLUE/SQUaD task: (give the name)
To reproduce
Steps to reproduce the behavior:
cd examples/seq2seq
CUDA_VISIBLE_DEVICES=0 PYTHONPATH=…/…/src USE_TF=0 ./run_translation.py
–model_name_or_path t5-large
–do_train --source_lang en --target_lang ro
–source_prefix "translate English to Romanian: "
–dataset_name wmt16 --dataset_config “ro-en”
–output_dir /tmp/tst-translation
–per_device_train_batch_size 4
–overwrite_output_dir
–predict_with_generate
–num_train_epochs 1 --fp16
Expected behavior
FP16 mode shouldn’t produce nan in loss.
Issue Analytics
- State:
- Created 2 years ago
- Comments:5 (2 by maintainers)
Top Results From Across the Web
T5 fp16 issue is fixed - Transformers - Hugging Face Forums
We have just fixed the T5 fp16 issue for some of the T5 models! ... The rest of the models produce nan loss/logits....
Read more >Loss is always nan when training a deep learning model from ...
One of the reasons: Check whether your dataset have NaN values or not. NaN values can cause problem to the model while learning....
Read more >Fp16 pytorch - Casale Giacinta
How does auto-grad mechanism avoid nan in fp16 mode? autograd. half() function ... Check the loss log Apr 08, 2018 · Clearing GPU...
Read more >Apex: nan loss with O1 and O2 - PyTorch Forums
Hi, i am using apex to try to fit a larger model (T5-large) on a single K40 GPU. i understand K40 is using...
Read more >Pytorch fp16 inference - MilanoIngegneria
T5v1 : t5-small , t5-base , t5-large. Nov 07, 2021 · dls is a DataLoaders ... When I run my model on half...
Read more >Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start FreeTop Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Top GitHub Comments
Why do you believe this to be the case? This model was trained in bf16, which has a totally different numerical range from fp16. So it shouldn’t produce NaNs under bf16 or fp32, but under fp16 it’s almost guaranteed to not work. Please see: https://discuss.huggingface.co/t/mixed-precision-for-bfloat16-pretrained-models/5315
That’s said, please try this branch https://github.com/huggingface/transformers/pull/10956 that tries to use a workaround for AMP. Some users reported success. One user reported problems.
And you can also try the new over/underflow detector: https://github.com/huggingface/transformers/pull/11274 if you want to get more precise info on where the problem emerges first. Just add
--debug activation_overflow
to the trainer command line and it will bail with the traces of the last frames as soon as nan or inf is encountered. I am reworking this tool to provide more info, and need to revamp the interface, but it’s mostly done.This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.
Please note that issues that do not follow the contributing guidelines are likely to be ignored.