conformer train has vanishing gradient and exploding gradient problems
See original GitHub issueconformer train use librispeech 960h unlabel data and meet forward Inf and backword NaN.
config/pretraining/wav2vec2_conformer_base_librispeech.yaml
[fairseq.trainer][INFO] - begin training epoch 35 [2022-02-27 21:23:49,458][fairseq_cli.train][INFO] - Start iterating over samples [2022-02-27 21:24:55,260][fairseq.trainer][INFO] - NOTE: gradient overflow detected, ignoring gradient, setting loss scale to: 0.25 [2022-02-27 21:34:07,578][train_inner][INFO] - {“epoch”: 35, “update”: 34.289, “loss”: “3.005”, “ntokens”: “181698”, “nsentences”: “655.207”, “prob_perplexity”: “200.664”, “code_perplexity”: “197.743”, “temp”: “1.76”, “loss_0”: “2.894”, “loss_1”: “0. 099”, “loss_2”: “0.013”, “accuracy”: “0.46115”, “wps”: “36533.7”, “ups”: “0.2”, “wpb”: “181698”, “bsz”: “655.2”, “num_updates”: “25600”, “lr”: “0.0004”, “gnorm”: “0.173”, “loss_scale”: “0.25”, “train_wall”: “604”, “gb_free”: “14.5”, “wall”: “629”} 2022-02-27 21:40:14 | WARNING | fairseq.nan_detector | Inf detected in output of , shape: torch.Size([101, 16, 271]), forward 2022-02-27 21:40:17 | WARNING | fairseq.nan_detector | NaN detected in output of , shape: torch.Size([101, 16, 226]), backward 2022-02-27 21:40:17 | WARNING | fairseq.nan_detector | NaN detected in output of , shape: torch.Size([101, 8, 322]), backward 2022-02-27 21:40:18 | WARNING | fairseq.nan_detector | NaN detected in output of , shape: torch.Size([101, 8, 351]), backward
FloatingPointError: Minimum loss scale reached (0.0001). Your loss is probably exploding.
Try lowering the learning rate, using gradient clipping or increasing the batch size. But can not solve the this problem.
- fairseq Version (e.g., 1.0 or main): 1.0.0a0+40ff55a
- PyTorch Version (e.g., 1.0): 1.10.1
- OS (e.g., Linux): centos
- How you installed fairseq (
pip
, source): pip install -e . - Build command you used (if compiling from source):
- Python version: 3.7.0
- CUDA/cuDNN version: 10.2
Issue Analytics
- State:
- Created 2 years ago
- Comments:7 (3 by maintainers)
Top GitHub Comments
Hi @donstang, turning off fp16 is another alternative but that might reduce the training speed a lot.
What attention mechanism are you using? Could you also share the hyper parameter settings corresponding to this run?
No.
I have tried many ways but still can’t solve. Now training only turn off fp16.