RuntimeError: hit nan for variance_normalized
See original GitHub issueCalling Ranger21 with mostly default parameters:
optimizer = ranger21.Ranger21(
net.parameters(), lr=0.001, num_epochs=50, weight_decay=1e-5,
num_batches_per_epoch=len(train_loader)
)
Training seems fine for half a day with decent progress on all loss metrics, but then halts:
File "./train_pt.py", line 727, in <module>
main(sys.argv[1:])
File "./train_pt.py", line 612, in main
optimizer.step()
File "/home/morbo/git/sjeng/train/venv19/lib/python3.8/site-packages/torch/optim/optimizer.py", line 88, in wrapper
return func(*args, **kwargs)
File "/home/morbo/git/sjeng/train/venv19/lib/python3.8/site-packages/torch/autograd/grad_mode.py", line 28, in decorate_context
return func(*args, **kwargs)
File "/home/morbo/git/Ranger21/ranger21/ranger21.py", line 714, in step
raise RuntimeError("hit nan for variance_normalized")
RuntimeError: hit nan for variance_normalized
Issue Analytics
- State:
- Created 2 years ago
- Comments:7
Top Results From Across the Web
Function 'MulBackward0' returned nan values in its 0th output ...
Hello,. I am facing the same RuntimeError. The autograd anomaly detection shows that I perform an inplace operation in variable Z. def update_clusters(self, ......
Read more >Ranger deep learning optimizer rewrite to use newest ...
Currently ranger 21 variance normalized occasionally acquires nan's and faults ... line 680, in step raise RuntimeError("hit nan for variance_normalized").
Read more >Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start FreeTop Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Top GitHub Comments
Reducing my learning rate solved it.
I integrated ranger21 into https://github.com/glinscott/nnue-pytorch and exploring different parameters. I’m hitting this issue always after first step of training.
This is what I’m using:
changing lr, eps, weight_decay, use_adaptive_gradient_clipping, use_warmup appears to have no effect. The NaN comes from the forward pass in the second step, so some weights become NaN. Adam and AdaBelief cores work fine.