Step should average the gradients by batch size.
See original GitHub issueIt seems to me the optimizer methods, given SGD as example, use sum of gradients from a batch to multiply learning rate directly:
def _compute_step(self, grad):
return - self.lr * grad
It is suggested we should average the grad
by batch size, the benefits of doing this is listed in this post. Basically you do not have to adjust learning rate when changing batch size.
If you agree to this, I would create a pull request to add option to use mean gradients and at the same time provide compatibility to use simply sum of gradients (for efficiency consideration).
Issue Analytics
- State:
- Created 4 years ago
- Comments:8 (8 by maintainers)
Top Results From Across the Web
How large should the batch size be for stochastic gradient ...
Batch gradient descent updates the weights θ using the gradients of the entire dataset x; whereas SGD updates the weights using an average...
Read more >How to Control the Stability of Training Neural Networks With ...
Historically, a training algorithm where the batch size is set to the total number of training examples is called “batch gradient descent” and...
Read more >Batch, Mini Batch & Stochastic Gradient Descent
In Batch Gradient Descent, all the training data is taken into consideration to take a single step. We take the average of the...
Read more >Effect of batch size on training dynamics | by Kevin Shen
Finding: larger batch sizes make larger gradient steps than smaller batch sizes for the same number of samples seen. The x-axis shows batch...
Read more >Gradient averaging with TensorFlow - Grzegorz Chlebus blog
When training big neural networks, it can happen that the biggest mini-batch size you can afford is one. In such cases training can...
Read more >Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start FreeTop Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Top GitHub Comments
I see your point. One possible way is to add a
reduction
parameter (like TF and PyTorch). But I would suggest just keeping everything simple for now.Another disadvantage is that if user accumulates the gradients in different batch size and then invoke
apply_grad
, my proposal would not handle this.I think I will close this issue for now, since the proposed change is not simple and elegant as the current implementation, although it has some advantages, they are not that important compared to keeping the simplicity of this project.