T5-v1.1 loss go to nan when fp16 training was enabled
See original GitHub issueEnvironment info
I test in two different environments. One is my native env, one is nvidia container pytorch_21.09. For more details, please refer https://docs.nvidia.com/deeplearning/frameworks/pytorch-release-notes/rel_21-09.html#rel_21-09
transformers
version: 4.11.3- Platform: Arch Linux 5.14.14-arch1-1 (Ubuntu 20.04)
- Python version: 3.9.7 (3.8)
- PyTorch version (GPU?): 1.9.1 (1.10a)
- Tensorflow version (GPU?): 2.6.0 (did not use)
- Using GPU in script?: 2080Ti (V100)
- Using distributed or parallel set-up in script?: using fp16
Who can help
@patrickvonplaten, @patil-suraj
Information
Model, I am using t5-v1.1 (small, base)
with mix-precision, loss would go to nan
.
The problem arises when using:
- the official example scripts: (give details below)
- my own modified scripts: (give details below)
The tasks I am working on is:
- an official GLUE/SQUaD task: (give the name)
- my own task or dataset: (give details below)
The bug can be reproduced with run_summarization & run_summarization_no_trainer.py
To reproduce
Steps to reproduce the behavior:
1.❯ Both the following scrips can reproduce the results
python run_summarization.py \
--fp16 --fp16_backend apex (both native amp & apex face thes same issue)\
--model_name_or_path google/t5-v1_1-base \
--do_train \
--do_eval \
--dataset_name cnn_dailymail \
--dataset_config "3.0.0" \
--source_prefix "summarize: " \
--output_dir /tmp/tst-summarization \
--per_device_train_batch_size=2 \
--per_device_eval_batch_size=2 \
--overwrite_output_dir \
accelerate launch --fp16 run_summarization_no_trainer.py \
--model_name_or_path google/t5-v1_1-base \
--dataset_name cnn_dailymail \
--dataset_config "3.0.0" \
--source_prefix "summarize: " \
--per_device_train_batch_size=2 \
--output_dir ~/tmp/tst-summarization \
- If you print the loss step by step, you will find out loss goes to
nan
. (for Trainer, I print the loss before trainer.trainig_step return)
Possible Reason
In https://github.com/huggingface/transformers/pull/10496, models clamp inf values only when hidden_states.dtype == torch.float16.
However, even when fp16 training is enabled, the hidden_states.dtype is still torch.float32
. This might be due to the layer_norm operation.
Here are some more informations that might be useful to you.
When using BART and T5 with fp16 training, the hidden_states.dtype is still torch.float32
, however; their loss won’t go to nan
.
Issue Analytics
- State:
- Created 2 years ago
- Reactions:1
- Comments:8 (8 by maintainers)
@stas00 @patrickvonplaten @LysandreJik PR #10956 does prevent T5 from going nan and achieving a comparable result in fp32. Close the issue and move to PR #10956 to discuss.
I am working with @HaokunLiu on a project that uses T5 and he found a great solution to this problem. The idea is to scale down the weights of the model in a specific pattern that maintains the relationship between the weights. I am not sure if this transformation is loss-preserving, but
logits.argmax
should remain the same.Here’s his script
in
__init__
https://github.com/huggingface/transformers/blob/84ea427f460ffc8d2ddc08a341ccda076c24fc1f/src/transformers/models/t5/modeling_t5.py#L1461
you need to add:
then in the
forward
https://github.com/huggingface/transformers/blob/84ea427f460ffc8d2ddc08a341ccda076c24fc1f/src/transformers/models/t5/modeling_t5.py#L1640
function you need the following lines here