Bug: Dynamic Convolution Attention fails in `mixed_precision` training.
See original GitHub issueDescribe the bug Dynamic Convolutional Attention fails in mixed_precision training and ultimately causes NaN error.
To Reproduce Steps to reproduce the behavior:
- set
mixed_precision=True
inconfig.json
. - set
dynamic_convolution=True
inconfig.json
. - start training a tacotron or tacotron2 model.
- On TB initially you observe broken attention alignment.
- Ultimately loss becomes NaN.
Expected behavior The model should learn the alignment after 10K iterations with no NaN loss as it does in full precision training.
Environment (please complete the following information):
- OS Platform and Distribution (e.g., Linux Ubuntu 16.04): Ubuntu 20.04
- PyTorch or TensorFlow version (use command below): Torch 1.8.0
- Python version: 3.8
- CUDA/cuDNN version: 11.2
- GPU model and memory: 1080Ti
- Exact command to reproduce:
Additional context Add any other context about the problem here.
Issue Analytics
- State:
- Created 2 years ago
- Comments:12 (10 by maintainers)
Top Results From Across the Web
Train With Mixed Precision - NVIDIA Documentation Center
These examples focus on achieving the best performance and convergence from NVIDIA Volta Tensor Cores by using the latest deep learning example ...
Read more >Attention Over Convolution Kernels - CVF Open Access
Dynamic convolutional neural networks (denoted as DY-. CNNs) are more difficult to train, as they require joint optimization of all convolution kernels and ......
Read more >arXiv:1912.03458v2 [cs.CV] 31 Mar 2020
We inspect if DY-CNN is dynamic, using DY-. MobileNetV2 ×0.5, which has K = 4 kernels per layer and is trained by using...
Read more >Mixed Precision Training - arXiv Vanity
There have been a number of publications on training Convolutional Neural ... Table 3: Character Error Rate (CER) using mixed precision training for...
Read more >INTRODUCTION TO MIXED PRECISION TRAINING - NVlabs
Tensor Cores for mixed precision training and inference ... large sums of values, e.g. in linear layers and convolutions, can be too big...
Read more >Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start FreeTop Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Top GitHub Comments
using APEX backend with the new API seemingly helps
I am not sure if it helps but I was getting the same error I fixed the issue by setting
r = 6
"gradual_training": [[0, 6, 64], [15000, 4, 64], [30000, 2, 32]]
and settingddc_r = 6
removing all odd values ofr
andddc_r
helped my case but alignments were still on and off for the most of time during training.