XLSR-53 crashes with NaNs after some epochs of pretraining
See original GitHub issue❓ Questions and Help
Before asking:
- search the issues.
- search the docs.
What is your question?
We tried pretraining the XLSR-53 checkpoint on some newer languages dataset of around 10K hours. The training loss seems to explode after few epochs and lowering the learning rate, increasing the batch size, etc doesn’t seem to help (it delays the exploding gradients to 3-4 more epochs but the training always seems to crash).
[2021-07-22 17:10:50,251][fairseq.trainer][INFO] - NOTE: gradient overflow detected, ignoring gradient, setting loss scale to: 0.5
[2021-07-22 17:15:02,126][fairseq.trainer][INFO] - NOTE: gradient overflow detected, ignoring gradient, setting loss scale to: 0.25
[2021-07-22 17:28:47,978][fairseq.trainer][INFO] - NOTE: gradient overflow detected, ignoring gradient, setting loss scale to: 0.25
[2021-07-22 17:45:07,100][fairseq.trainer][INFO] - NOTE: gradient overflow detected, ignoring gradient, setting loss scale to: 0.25
[2021-07-22 18:09:10,411][fairseq.trainer][INFO] - NOTE: gradient overflow detected, ignoring gradient, setting loss scale to: 0.5
[2021-07-22 18:19:29,514][fairseq.trainer][INFO] - NOTE: gradient overflow detected, ignoring gradient, setting loss scale to: 0.5
[2021-07-22 18:23:28,679][fairseq.trainer][INFO] - NOTE: gradient overflow detected, ignoring gradient, setting loss scale to: 0.25
[2021-07-22 18:37:12,573][fairseq.trainer][INFO] - NOTE: gradient overflow detected, ignoring gradient, setting loss scale to: 0.25
[2021-07-22 18:59:19,864][fairseq.trainer][INFO] - NOTE: gradient overflow detected, ignoring gradient, setting loss scale to: 0.5
[2021-07-22 19:07:13,025][fairseq.trainer][INFO] - NOTE: gradient overflow detected, ignoring gradient, setting loss scale to: 0.25
[2021-07-22 19:07:22,087][fairseq.trainer][INFO] - NOTE: gradient overflow detected, ignoring gradient, setting loss scale to: 0.125
[2021-07-22 19:07:31,087][fairseq.trainer][INFO] - NOTE: gradient overflow detected, ignoring gradient, setting loss scale to: 0.0625
[2021-07-22 19:07:44,846][fairseq.trainer][INFO] - NOTE: gradient overflow detected, ignoring gradient, setting loss scale to: 0.03125
[2021-07-22 19:07:58,494][fairseq.trainer][INFO] - NOTE: gradient overflow detected, ignoring gradient, setting loss scale to: 0.015625
[2021-07-22 19:08:03,117][fairseq.trainer][INFO] - NOTE: gradient overflow detected, ignoring gradient, setting loss scale to: 0.0078125
[2021-07-22 19:08:07,374][fairseq.trainer][INFO] - NOTE: gradient overflow detected, ignoring gradient, setting loss scale to: 0.00390625
[2021-07-22 19:08:16,389][fairseq.trainer][INFO] - NOTE: gradient overflow detected, ignoring gradient, setting loss scale to: 0.001953125
[2021-07-22 19:08:25,450][fairseq.trainer][INFO] - NOTE: gradient overflow detected, ignoring gradient, setting loss scale to: 0.0009765625
[2021-07-22 19:08:30,198][fairseq.trainer][INFO] - NOTE: gradient overflow detected, ignoring gradient, setting loss scale to: 0.00048828125
[2021-07-22 19:08:34,115][fairseq.trainer][INFO] - NOTE: gradient overflow detected, ignoring gradient, setting loss scale to: 0.000244140625
[2021-07-22 19:08:38,548][fairseq.trainer][INFO] - NOTE: gradient overflow detected, ignoring gradient, setting loss scale to: 0.0001220703125
Additionally, we implement temperature sampling and keep alpha
to 0.7
Code
We use the same config as wav2vec2 large model
^ We increase the max_tokens
from 1200000 to 1800000.
What have you tried?
- training with LR 5e-4 and warmup_steps of 32K:
^ the loss decreases as expected for 2 epochs and at the start of epoch 3, the training crashes due to exploding gradients.
- training with LR 5e-4 and warmup_steps of 32K and clipping l2 norm of gradients to 1.0
^ similar behavior as before.
- training with LR 5e-5 and no warmup_steps and clipping l2 norm of gradients to 1.0
^ This runs for 10 epochs before crashing due to the higher L2 norm of the gradients.
- training with LR 5e-5 and no warmup_steps and clipping l2 norm of gradients to 0.5
^ similar behavior as 3) but runs for one more epoch.
The best checkpoints obtained from experiment 3) or 4) also crash with exploding gradients during fine-tuning
Any ideas on how to make the training stable? @alexeib @aconneau @myleott @michaelauli
What’s your environment?
- fairseq Version (e.g., 1.0 or master): master
- PyTorch Version (e.g., 1.0) 1.7.1
- OS (e.g., Linux): Linnux
- How you installed fairseq (
pip
, source): source - Build command you used (if compiling from source):
pip install -e .
- Python version: 3.7
- CUDA/cuDNN version: 11.0
- GPU models and configuration: 8 A100s
- Any other relevant information:
Issue Analytics
- State:
- Created 2 years ago
- Comments:18 (2 by maintainers)
Top GitHub Comments
i was able to stabilize most models with just the changes i mentioned, but you are free to experiment and please share the results!
@gowtham1997 The loss scale is even 0.0001220703125, which maybe mean your gradient/loss is very large? The max value of FP16 is 65504, while BF16 is 3e38.