Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

XLSR-53 crashes with NaNs after some epochs of pretraining

See original GitHub issue

❓ Questions and Help

Before asking:

search the issues.
search the docs.

What is your question?

We tried pretraining the XLSR-53 checkpoint on some newer languages dataset of around 10K hours. The training loss seems to explode after few epochs and lowering the learning rate, increasing the batch size, etc doesn’t seem to help (it delays the exploding gradients to 3-4 more epochs but the training always seems to crash).

[2021-07-22 17:10:50,251][fairseq.trainer][INFO] - NOTE: gradient overflow detected, ignoring gradient, setting loss scale to: 0.5
[2021-07-22 17:15:02,126][fairseq.trainer][INFO] - NOTE: gradient overflow detected, ignoring gradient, setting loss scale to: 0.25
[2021-07-22 17:28:47,978][fairseq.trainer][INFO] - NOTE: gradient overflow detected, ignoring gradient, setting loss scale to: 0.25
[2021-07-22 17:45:07,100][fairseq.trainer][INFO] - NOTE: gradient overflow detected, ignoring gradient, setting loss scale to: 0.25
[2021-07-22 18:09:10,411][fairseq.trainer][INFO] - NOTE: gradient overflow detected, ignoring gradient, setting loss scale to: 0.5
[2021-07-22 18:19:29,514][fairseq.trainer][INFO] - NOTE: gradient overflow detected, ignoring gradient, setting loss scale to: 0.5
[2021-07-22 18:23:28,679][fairseq.trainer][INFO] - NOTE: gradient overflow detected, ignoring gradient, setting loss scale to: 0.25
[2021-07-22 18:37:12,573][fairseq.trainer][INFO] - NOTE: gradient overflow detected, ignoring gradient, setting loss scale to: 0.25
[2021-07-22 18:59:19,864][fairseq.trainer][INFO] - NOTE: gradient overflow detected, ignoring gradient, setting loss scale to: 0.5
[2021-07-22 19:07:13,025][fairseq.trainer][INFO] - NOTE: gradient overflow detected, ignoring gradient, setting loss scale to: 0.25
[2021-07-22 19:07:22,087][fairseq.trainer][INFO] - NOTE: gradient overflow detected, ignoring gradient, setting loss scale to: 0.125
[2021-07-22 19:07:31,087][fairseq.trainer][INFO] - NOTE: gradient overflow detected, ignoring gradient, setting loss scale to: 0.0625
[2021-07-22 19:07:44,846][fairseq.trainer][INFO] - NOTE: gradient overflow detected, ignoring gradient, setting loss scale to: 0.03125
[2021-07-22 19:07:58,494][fairseq.trainer][INFO] - NOTE: gradient overflow detected, ignoring gradient, setting loss scale to: 0.015625
[2021-07-22 19:08:03,117][fairseq.trainer][INFO] - NOTE: gradient overflow detected, ignoring gradient, setting loss scale to: 0.0078125
[2021-07-22 19:08:07,374][fairseq.trainer][INFO] - NOTE: gradient overflow detected, ignoring gradient, setting loss scale to: 0.00390625
[2021-07-22 19:08:16,389][fairseq.trainer][INFO] - NOTE: gradient overflow detected, ignoring gradient, setting loss scale to: 0.001953125
[2021-07-22 19:08:25,450][fairseq.trainer][INFO] - NOTE: gradient overflow detected, ignoring gradient, setting loss scale to: 0.0009765625
[2021-07-22 19:08:30,198][fairseq.trainer][INFO] - NOTE: gradient overflow detected, ignoring gradient, setting loss scale to: 0.00048828125
[2021-07-22 19:08:34,115][fairseq.trainer][INFO] - NOTE: gradient overflow detected, ignoring gradient, setting loss scale to: 0.000244140625
[2021-07-22 19:08:38,548][fairseq.trainer][INFO] - NOTE: gradient overflow detected, ignoring gradient, setting loss scale to: 0.0001220703125

Additionally, we implement temperature sampling and keep alpha to 0.7

Code

We use the same config as wav2vec2 large model

^ We increase the max_tokens from 1200000 to 1800000.

What have you tried?

training with LR 5e-4 and warmup_steps of 32K:

^ the loss decreases as expected for 2 epochs and at the start of epoch 3, the training crashes due to exploding gradients.

training with LR 5e-4 and warmup_steps of 32K and clipping l2 norm of gradients to 1.0

^ similar behavior as before.

training with LR 5e-5 and no warmup_steps and clipping l2 norm of gradients to 1.0

^ This runs for 10 epochs before crashing due to the higher L2 norm of the gradients.

training with LR 5e-5 and no warmup_steps and clipping l2 norm of gradients to 0.5

^ similar behavior as 3) but runs for one more epoch.

The best checkpoints obtained from experiment 3) or 4) also crash with exploding gradients during fine-tuning

Any ideas on how to make the training stable? @alexeib @aconneau @myleott @michaelauli

What’s your environment?

fairseq Version (e.g., 1.0 or master): master
PyTorch Version (e.g., 1.0) 1.7.1
OS (e.g., Linux): Linnux
How you installed fairseq (pip, source): source
Build command you used (if compiling from source): pip install -e .
Python version: 3.7
CUDA/cuDNN version: 11.0
GPU models and configuration: 8 A100s
Any other relevant information:

Issue Analytics

State:
Created 2 years ago
Comments:18 (2 by maintainers)

Top GitHub Comments

2reactions

alexeibcommented, Aug 9, 2021

i was able to stabilize most models with just the changes i mentioned, but you are free to experiment and please share the results!

1reaction

codecautioncommented, Aug 6, 2021

@gowtham1997 The loss scale is even 0.0001220703125, which maybe mean your gradient/loss is very large? The max value of FP16 is 65504, while BF16 is 3e38.