Pegasus pretraining in fp16 results in NaN loss
See original GitHub issueEnvironment info
transformers
version: 4.5.1
- Platform: Linux-5.4.0-73-generic-x86_64-with-glibc2.29
- Python version: 3.8.5
- PyTorch version (GPU?): 1.4.0 (True)
- Tensorflow version (GPU?): not installed (NA)
- Using GPU in script?: yes
- Using distributed or parallel set-up in script?: no
Who can help
@patrickvonplaten, @patil-suraj
Information
Model I am using: pegasus
The problem arises when using:
- my own modified scripts:
config = PegasusConfig(**pegasus_config_kwargs)
model = PegasusForConditionalGeneration(config=config)
and then using Trainer with fp16 on.
The trainer args I’m using:
{
"logging_strategy": "steps",
"logging_steps": 20,
"save_strategy": "steps",
"save_steps": 5000,
"num_train_epochs": 2,
"lr_scheduler_type": "linear",
"warmup_steps": 10000,
"learning_rate": 0.001,
"dataloader_num_workers": 8,
"per_device_train_batch_size": 16,
"gradient_accumulation_steps": 16,
"group_by_length": true,
"adafactor": true,
"fp16": true
}
The tasks I am working on is:
- my own task or dataset
To reproduce
I was trying to pretrain pegasus in fp16 from scratch using a modified script. The training is much faster, around 40% speedup, but after almost 3 days, training was 10% into a second epoch, a NaN loss happened. Debugging the place where overflow occurred I guess is possible, but will be troublesome. Do you know what could be the problem or if someone is working on problems with fp16 on pegasus?
I’ve seen for example that it could be a problem when using pretrained checkpoints (https://discuss.huggingface.co/t/finetuning-for-fp16-compatibility/977), but shouldn’t it work when initializing model from config, like below?
config = PegasusConfig(**pegasus_config_kwargs)
model = PegasusForConditionalGeneration(config=config)
Training without fp16 works fine.
Issue Analytics
- State:
- Created 2 years ago
- Comments:14 (6 by maintainers)
Top GitHub Comments
Thank you for guiding me on how to debug the model and pointing out possible fixes. It took me some time to wrap my head around fp16. I think now I have a clear understanding on how to approach it.
For now I made simple patches and will be running some more training and see how it goes. If I get some nice results, I’ll post some summary here and do a PR.
I’m glad to hear that you can now easily tell where things overflow, @kolakows.
Please remember that the original code was trained in a different dtype regime (bf16 or fp32/tf32) and so the designers of the model haven’t had to deal with fp16 and that’s why changes are needed to be applied to the original port. This same story happens to pretty much all models of this kind (i.e. not designed to be trained with fp16 in mind).
I trust you will be able to tweak the code to overcome this.
You can approach this in 3 ways
autocast
for the duration of the “sensitive” few lines of code.Then you can PR the changes and hopefully others will enjoy the fruit of your hard labour. Thank you!