question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

XLSR-53 crashes with NaNs after some epochs of pretraining

See original GitHub issue

❓ Questions and Help

Before asking:

  1. search the issues.
  2. search the docs.

What is your question?

We tried pretraining the XLSR-53 checkpoint on some newer languages dataset of around 10K hours. The training loss seems to explode after few epochs and lowering the learning rate, increasing the batch size, etc doesn’t seem to help (it delays the exploding gradients to 3-4 more epochs but the training always seems to crash).

[2021-07-22 17:10:50,251][fairseq.trainer][INFO] - NOTE: gradient overflow detected, ignoring gradient, setting loss scale to: 0.5
[2021-07-22 17:15:02,126][fairseq.trainer][INFO] - NOTE: gradient overflow detected, ignoring gradient, setting loss scale to: 0.25
[2021-07-22 17:28:47,978][fairseq.trainer][INFO] - NOTE: gradient overflow detected, ignoring gradient, setting loss scale to: 0.25
[2021-07-22 17:45:07,100][fairseq.trainer][INFO] - NOTE: gradient overflow detected, ignoring gradient, setting loss scale to: 0.25
[2021-07-22 18:09:10,411][fairseq.trainer][INFO] - NOTE: gradient overflow detected, ignoring gradient, setting loss scale to: 0.5
[2021-07-22 18:19:29,514][fairseq.trainer][INFO] - NOTE: gradient overflow detected, ignoring gradient, setting loss scale to: 0.5
[2021-07-22 18:23:28,679][fairseq.trainer][INFO] - NOTE: gradient overflow detected, ignoring gradient, setting loss scale to: 0.25
[2021-07-22 18:37:12,573][fairseq.trainer][INFO] - NOTE: gradient overflow detected, ignoring gradient, setting loss scale to: 0.25
[2021-07-22 18:59:19,864][fairseq.trainer][INFO] - NOTE: gradient overflow detected, ignoring gradient, setting loss scale to: 0.5
[2021-07-22 19:07:13,025][fairseq.trainer][INFO] - NOTE: gradient overflow detected, ignoring gradient, setting loss scale to: 0.25
[2021-07-22 19:07:22,087][fairseq.trainer][INFO] - NOTE: gradient overflow detected, ignoring gradient, setting loss scale to: 0.125
[2021-07-22 19:07:31,087][fairseq.trainer][INFO] - NOTE: gradient overflow detected, ignoring gradient, setting loss scale to: 0.0625
[2021-07-22 19:07:44,846][fairseq.trainer][INFO] - NOTE: gradient overflow detected, ignoring gradient, setting loss scale to: 0.03125
[2021-07-22 19:07:58,494][fairseq.trainer][INFO] - NOTE: gradient overflow detected, ignoring gradient, setting loss scale to: 0.015625
[2021-07-22 19:08:03,117][fairseq.trainer][INFO] - NOTE: gradient overflow detected, ignoring gradient, setting loss scale to: 0.0078125
[2021-07-22 19:08:07,374][fairseq.trainer][INFO] - NOTE: gradient overflow detected, ignoring gradient, setting loss scale to: 0.00390625
[2021-07-22 19:08:16,389][fairseq.trainer][INFO] - NOTE: gradient overflow detected, ignoring gradient, setting loss scale to: 0.001953125
[2021-07-22 19:08:25,450][fairseq.trainer][INFO] - NOTE: gradient overflow detected, ignoring gradient, setting loss scale to: 0.0009765625
[2021-07-22 19:08:30,198][fairseq.trainer][INFO] - NOTE: gradient overflow detected, ignoring gradient, setting loss scale to: 0.00048828125
[2021-07-22 19:08:34,115][fairseq.trainer][INFO] - NOTE: gradient overflow detected, ignoring gradient, setting loss scale to: 0.000244140625
[2021-07-22 19:08:38,548][fairseq.trainer][INFO] - NOTE: gradient overflow detected, ignoring gradient, setting loss scale to: 0.0001220703125

Additionally, we implement temperature sampling and keep alpha to 0.7

Code

We use the same config as wav2vec2 large model

^ We increase the max_tokens from 1200000 to 1800000.

What have you tried?

  1. training with LR 5e-4 and warmup_steps of 32K: image

^ the loss decreases as expected for 2 epochs and at the start of epoch 3, the training crashes due to exploding gradients.

  1. training with LR 5e-4 and warmup_steps of 32K and clipping l2 norm of gradients to 1.0 image

^ similar behavior as before.

  1. training with LR 5e-5 and no warmup_steps and clipping l2 norm of gradients to 1.0 image

^ This runs for 10 epochs before crashing due to the higher L2 norm of the gradients. image

  1. training with LR 5e-5 and no warmup_steps and clipping l2 norm of gradients to 0.5

^ similar behavior as 3) but runs for one more epoch.

The best checkpoints obtained from experiment 3) or 4) also crash with exploding gradients during fine-tuning

Any ideas on how to make the training stable? @alexeib @aconneau @myleott @michaelauli

What’s your environment?

  • fairseq Version (e.g., 1.0 or master): master
  • PyTorch Version (e.g., 1.0) 1.7.1
  • OS (e.g., Linux): Linnux
  • How you installed fairseq (pip, source): source
  • Build command you used (if compiling from source): pip install -e .
  • Python version: 3.7
  • CUDA/cuDNN version: 11.0
  • GPU models and configuration: 8 A100s
  • Any other relevant information:

Issue Analytics

  • State:closed
  • Created 2 years ago
  • Comments:18 (2 by maintainers)

github_iconTop GitHub Comments

2reactions
alexeibcommented, Aug 9, 2021

i was able to stabilize most models with just the changes i mentioned, but you are free to experiment and please share the results!

1reaction
codecautioncommented, Aug 6, 2021

@gowtham1997 The loss scale is even 0.0001220703125, which maybe mean your gradient/loss is very large? The max value of FP16 is 65504, while BF16 is 3e38.

Read more comments on GitHub >

github_iconTop Results From Across the Web

XLSR-53 crashes with NaNs after some epochs of pretraining
We tried pretraining the XLSR-53 checkpoint on some newer languages dataset of around 10K hours. The training loss seems to explode after few...
Read more >
NaN loss when training regression network - Stack Overflow
I was running into my loss function suddenly returning a nan after it go so far into the training process. I checked the...
Read more >
XLSR Wav2Vec2 Fine-Tuning Week - HackMD
The job fails during the pre-processing: A possible reason for this is that your local/ephemeral storage is exceeding the limit. This happens due...
Read more >
Proceedings of the 13th Language Resources and Evaluation ...
a large pretrained language model. Expectedly all languages but English have very bad results (cf. first line of Table 2).
Read more >
Conference Handbook (PDF) - naacl 2022
For the first time, NAACL-HLT 2022 is a hybrid conference. After two years of exclusively virtual conferences due to the COVID-19 pandemic, ...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found