Dev Observability
Product
Pricing
Docs
Resources
Blog
Company
Debug Wordle

question-mark

Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

APEX Gradient overflow

See original GitHub issue

When I use the O1 train the swin-net, but Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 131072.0 [2021-04-25 05:35:19 swin_base_patch4_window7_224](main_prune.py 310): INFO Train: [0/300][4050/5004] eta 0:17:06 lr 0.000500 time 1.1737 (1.0765) loss 3.2572 (3.3279) grad_norm 1.0323 (nan) mem 4814MB

Is this normal?

Issue Analytics

State:
Created 2 years ago
Comments:5 (2 by maintainers)

Top GitHub Comments

1reaction

zeliu98commented, Apr 26, 2021

Hi, there is no need to worry, it’s normal, since apex use a dynamic loss scale. You can ref to the Doc for more detail.

0reactions

zeliu98commented, Apr 27, 2021

Hi, please refer to this issue #29 for the training log.

Read more comments on GitHub >

Top Results From Across the Web

How to handle gradient overflow when training a deep model ...

Hi, it is ok to train the model with fp32, but we would like to take advantage of the speed of mixed precision....

Apex Loss Scale not stopping - PyTorch Forums

If the gradients overflow due to loss scaling, the scaling value will be lowered and you will see this message.

gradient overflow - Apex使用教程与梯度爆炸问题

Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 131072.0.

nvidia apex Gradient overflow. Skipping step, loss scaler 0 ...

nvidia apex Gradient overflow. ... 混合精度计算（Mixed Precision），并介绍一款Nvidia开发的基于PyTorch的混合精度训练加速神器--Apex，.

docs_aicloud/torch_amp_example at master

Skipping step, loss scaler 0 reducing loss scale to 0.5 Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 0.25 Torch...

Top Related Medium Post

No results found

Top Related StackOverflow Question

No results found

Troubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.

Top Related Reddit Thread

No results found

Top Related Hackernoon Post

No results found

Top Related Tweet

No results found

Top Related Dev.to Post

No results found

Top Related Hashnode Post

No results found

How to finetune the 384 models with window size 12?

The Question about the mask of window attention