Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Pegasus pretraining in fp16 results in NaN loss

See original GitHub issue

Environment info

transformers version: 4.5.1

Platform: Linux-5.4.0-73-generic-x86_64-with-glibc2.29
Python version: 3.8.5
PyTorch version (GPU?): 1.4.0 (True)
Tensorflow version (GPU?): not installed (NA)
Using GPU in script?: yes
Using distributed or parallel set-up in script?: no

Who can help

@patrickvonplaten, @patil-suraj

Information

Model I am using: pegasus

The problem arises when using:

my own modified scripts:

config = PegasusConfig(**pegasus_config_kwargs)
model = PegasusForConditionalGeneration(config=config)

and then using Trainer with fp16 on.

The trainer args I’m using:

{
  "logging_strategy": "steps",
  "logging_steps": 20,
  "save_strategy": "steps",
  "save_steps": 5000,
  "num_train_epochs": 2,
  "lr_scheduler_type": "linear",
  "warmup_steps": 10000,
  "learning_rate": 0.001,
  "dataloader_num_workers": 8,
  "per_device_train_batch_size": 16,
  "gradient_accumulation_steps": 16,
  "group_by_length": true,
  "adafactor": true,
  "fp16": true
}

The tasks I am working on is:

my own task or dataset

To reproduce

I was trying to pretrain pegasus in fp16 from scratch using a modified script. The training is much faster, around 40% speedup, but after almost 3 days, training was 10% into a second epoch, a NaN loss happened. Debugging the place where overflow occurred I guess is possible, but will be troublesome. Do you know what could be the problem or if someone is working on problems with fp16 on pegasus?

I’ve seen for example that it could be a problem when using pretrained checkpoints (https://discuss.huggingface.co/t/finetuning-for-fp16-compatibility/977), but shouldn’t it work when initializing model from config, like below?

config = PegasusConfig(**pegasus_config_kwargs)
model = PegasusForConditionalGeneration(config=config)

Training without fp16 works fine.

Issue Analytics

State:
Created 2 years ago
Comments:14 (6 by maintainers)

Top GitHub Comments

1reaction

kolakowscommented, Jul 14, 2021

Thank you for guiding me on how to debug the model and pointing out possible fixes. It took me some time to wrap my head around fp16. I think now I have a clear understanding on how to approach it.

For now I made simple patches and will be running some more training and see how it goes. If I get some nice results, I’ll post some summary here and do a PR.

1reaction

stas00commented, Jul 13, 2021

I’m glad to hear that you can now easily tell where things overflow, @kolakows.

Please remember that the original code was trained in a different dtype regime (bf16 or fp32/tf32) and so the designers of the model haven’t had to deal with fp16 and that’s why changes are needed to be applied to the original port. This same story happens to pretty much all models of this kind (i.e. not designed to be trained with fp16 in mind).

I trust you will be able to tweak the code to overcome this.

You can approach this in 3 ways

explicit upcasting as you suggested
turning off autocast for the duration of the “sensitive” few lines of code.
yet another approach is to change the loss function to punish the high weights and encourage the model to use weights in a safe fp16 range, e.g. for t5 https://github.com/huggingface/transformers/pull/10956#issuecomment-820712267 - which may or may not work here and of course need to think how to add the extra component in a sensible way.

Then you can PR the changes and hopefully others will enjoy the fruit of your hard labour. Thank you!

Top Results From Across the Web

Finetuning for fp16 compatibility - Hugging Face Forums

I tried one experiment on google/pegasus-xsum where I finetune with summarization lm loss and add some additional losses based on the magnitude ...

Training with fp16 precision gives nan in Longt5 #17978

It indeed works properly under a small parameter setting. Expected behavior. longt5 model not producing nan loss on fp16.

CONT: Contrastive Neural Text Generation - arXiv

Recently, contrastive learning attracts increasing interests in neural text generation as a new solution to alleviate the exposure bias ...

FP16 gives NaN loss when using pre-trained model

I tried the new fp16 in native torch. However, when I continue my model training for my segmentation task I get loss as...

Pre-trained models: Past, present and future - ScienceDirect

Owing to sophisticated pre-training objectives and huge model parameters, ... (FP16) can accomplish most of the computation with little precision loss ...