why rescale_grad is 1.0 / args.ctx_num ?
See original GitHub issueFirst of all, Thank you for sharing the code!
Every time I train a model with insightface, it says “UserWarning: Optimizer created manually outside Module but rescale_grad is not normalized to 1.0/batch_size/num_workers (0.25 vs. 0.001953125)
”, which cause my attention.
In mxnet, SGD(standard updates) are applied by:
rescaled_grad = lr * rescale_grad * clip(grad, clip_gradient) + wd * weight
state = momentum * state + rescaled_grad
weight = weight - state
http://mxnet.incubator.apache.org/api/python/optimization/optimization.html#mxnet.optimizer.SGD
I think rescale_grad = 1.0/batch_size/num_workers is more natural, because the gradient is averaged by batch_size and is closer to the true full-batch gradient. I don’t understand why use 1.0 / args.ctx_num instead .
Any help will be appreciated, thanks in advance.
Issue Analytics
- State:
- Created 6 years ago
- Comments:6
Top Results From Across the Web
No results found
Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free
Top Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
In softmax,
normalization='valid'
has already rescale the grad to 1.0/batch_size.@nttstar Thank for your instant reply. Maybe there is a misunderstanding, I know your implementation for rescale_grad = 1.0 / args.ctx_num in your codes. However, in MXNet official document for optimizer.SGD, the parameter rescale_grad is recommended by 1.0/batch_size, so maybe 1.0/batch_size/args.ctx_num is reasonable. In here, what is the reason you replace with 1.0/args.ctx_num ?
Thanks!