Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

why rescale_grad is 1.0 / args.ctx_num ?

See original GitHub issue

First of all, Thank you for sharing the code!

Every time I train a model with insightface, it says “UserWarning: Optimizer created manually outside Module but rescale_grad is not normalized to 1.0/batch_size/num_workers (0.25 vs. 0.001953125)”, which cause my attention.

In mxnet, SGD(standard updates) are applied by:

rescaled_grad = lr * rescale_grad * clip(grad, clip_gradient) + wd * weight
state = momentum * state + rescaled_grad
weight = weight - state

http://mxnet.incubator.apache.org/api/python/optimization/optimization.html#mxnet.optimizer.SGD

I think rescale_grad = 1.0/batch_size/num_workers is more natural, because the gradient is averaged by batch_size and is closer to the true full-batch gradient. I don’t understand why use 1.0 / args.ctx_num instead .

Any help will be appreciated, thanks in advance.

Issue Analytics

State:
Created 6 years ago
Comments:6

Top GitHub Comments

2reactions

nttstarcommented, Mar 22, 2018

In softmax, normalization='valid' has already rescale the grad to 1.0/batch_size.

0reactions

bruinxiongcommented, Mar 22, 2018

@nttstar Thank for your instant reply. Maybe there is a misunderstanding, I know your implementation for rescale_grad = 1.0 / args.ctx_num in your codes. However, in MXNet official document for optimizer.SGD, the parameter rescale_grad is recommended by 1.0/batch_size, so maybe 1.0/batch_size/args.ctx_num is reasonable. In here, what is the reason you replace with 1.0/args.ctx_num ?

Thanks!