Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

RuntimeError: hit nan for variance_normalized

See original GitHub issue

Calling Ranger21 with mostly default parameters:

    optimizer = ranger21.Ranger21(
        net.parameters(), lr=0.001, num_epochs=50, weight_decay=1e-5,
        num_batches_per_epoch=len(train_loader)
    )

Training seems fine for half a day with decent progress on all loss metrics, but then halts:

File "./train_pt.py", line 727, in <module>
    main(sys.argv[1:])
  File "./train_pt.py", line 612, in main
    optimizer.step()
  File "/home/morbo/git/sjeng/train/venv19/lib/python3.8/site-packages/torch/optim/optimizer.py", line 88, in wrapper
    return func(*args, **kwargs)
  File "/home/morbo/git/sjeng/train/venv19/lib/python3.8/site-packages/torch/autograd/grad_mode.py", line 28, in decorate_context
    return func(*args, **kwargs)
  File "/home/morbo/git/Ranger21/ranger21/ranger21.py", line 714, in step
    raise RuntimeError("hit nan for variance_normalized")
RuntimeError: hit nan for variance_normalized

Issue Analytics

State:
Created 2 years ago
Comments:7

Top GitHub Comments

1reaction

swarmtcommented, Sep 13, 2021

Reducing my learning rate solved it.

0reactions

Sopel97commented, Mar 17, 2022

I integrated ranger21 into https://github.com/glinscott/nnue-pytorch and exploring different parameters. I’m hitting this issue always after first step of training.

This is what I’m using:

    optimizer = ranger21.Ranger21(train_params,
      lr=8.75e-4, betas=(.9, 0.999), eps=1.0e-7,
      using_gc=False, using_normgc=False,
      weight_decay=0,
      num_batches_per_epoch=int(self.epoch_size/self.batch_size), num_epochs=self.max_epochs,
      warmdown_active=False, use_warmup=False,
      use_adaptive_gradient_clipping=False,
      softplus=False,
      use_madgrad=True,
      pnm_momentum_factor=0.0)

changing lr, eps, weight_decay, use_adaptive_gradient_clipping, use_warmup appears to have no effect. The NaN comes from the forward pass in the second step, so some weights become NaN. Adam and AdaBelief cores work fine.

Top Results From Across the Web

Function 'MulBackward0' returned nan values in its 0th output ...

Hello,. I am facing the same RuntimeError. The autograd anomaly detection shows that I perform an inplace operation in variable Z. def update_clusters(self, ......