APEX Gradient overflow
See original GitHub issueWhen I use the O1 train the swin-net, but
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 131072.0 [2021-04-25 05:35:19 swin_base_patch4_window7_224](main_prune.py 310): INFO Train: [0/300][4050/5004] eta 0:17:06 lr 0.000500 time 1.1737 (1.0765) loss 3.2572 (3.3279) grad_norm 1.0323 (nan) mem 4814MB
Is this normal?
Issue Analytics
- State:
- Created 2 years ago
- Comments:5 (2 by maintainers)
Top Results From Across the Web
How to handle gradient overflow when training a deep model ...
Hi, it is ok to train the model with fp32, but we would like to take advantage of the speed of mixed precision....
Read more >Apex Loss Scale not stopping - PyTorch Forums
If the gradients overflow due to loss scaling, the scaling value will be lowered and you will see this message.
Read more >gradient overflow - Apex使用教程与梯度爆炸问题
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 131072.0.
Read more >nvidia apex Gradient overflow. Skipping step, loss scaler 0 ...
nvidia apex Gradient overflow. ... 混合精度计算(Mixed Precision),并介绍一款Nvidia开发的基于PyTorch的混合精度训练加速神器--Apex,.
Read more >docs_aicloud/torch_amp_example at master
Skipping step, loss scaler 0 reducing loss scale to 0.5 Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 0.25 Torch...
Read more >Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start FreeTop Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Top GitHub Comments
Hi, there is no need to worry, it’s normal, since apex use a
dynamic
loss scale. You can ref to the Doc for more detail.Hi, please refer to this issue #29 for the training log.