question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Step should average the gradients by batch size.

See original GitHub issue

It seems to me the optimizer methods, given SGD as example, use sum of gradients from a batch to multiply learning rate directly:

def _compute_step(self, grad):
    return - self.lr * grad

It is suggested we should average the grad by batch size, the benefits of doing this is listed in this post. Basically you do not have to adjust learning rate when changing batch size.

If you agree to this, I would create a pull request to add option to use mean gradients and at the same time provide compatibility to use simply sum of gradients (for efficiency consideration).

Issue Analytics

  • State:closed
  • Created 4 years ago
  • Comments:8 (8 by maintainers)

github_iconTop GitHub Comments

1reaction
borgwangcommented, Sep 13, 2019

I see your point. One possible way is to add a reduction parameter (like TF and PyTorch). But I would suggest just keeping everything simple for now.

0reactions
w32zhongcommented, Sep 13, 2019

Another disadvantage is that if user accumulates the gradients in different batch size and then invoke apply_grad, my proposal would not handle this.

I think I will close this issue for now, since the proposed change is not simple and elegant as the current implementation, although it has some advantages, they are not that important compared to keeping the simplicity of this project.

Read more comments on GitHub >

github_iconTop Results From Across the Web

How large should the batch size be for stochastic gradient ...
Batch gradient descent updates the weights θ using the gradients of the entire dataset x; whereas SGD updates the weights using an average...
Read more >
How to Control the Stability of Training Neural Networks With ...
Historically, a training algorithm where the batch size is set to the total number of training examples is called “batch gradient descent” and...
Read more >
Batch, Mini Batch & Stochastic Gradient Descent
In Batch Gradient Descent, all the training data is taken into consideration to take a single step. We take the average of the...
Read more >
Effect of batch size on training dynamics | by Kevin Shen
Finding: larger batch sizes make larger gradient steps than smaller batch sizes for the same number of samples seen. The x-axis shows batch...
Read more >
Gradient averaging with TensorFlow - Grzegorz Chlebus blog
When training big neural networks, it can happen that the biggest mini-batch size you can afford is one. In such cases training can...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found