Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Step should average the gradients by batch size.

See original GitHub issue

It seems to me the optimizer methods, given SGD as example, use sum of gradients from a batch to multiply learning rate directly:

def _compute_step(self, grad):
    return - self.lr * grad

It is suggested we should average the grad by batch size, the benefits of doing this is listed in this post. Basically you do not have to adjust learning rate when changing batch size.

If you agree to this, I would create a pull request to add option to use mean gradients and at the same time provide compatibility to use simply sum of gradients (for efficiency consideration).

Issue Analytics

State:
Created 4 years ago
Comments:8 (8 by maintainers)

Top GitHub Comments

1reaction

borgwangcommented, Sep 13, 2019

I see your point. One possible way is to add a reduction parameter (like TF and PyTorch). But I would suggest just keeping everything simple for now.

0reactions

w32zhongcommented, Sep 13, 2019

Another disadvantage is that if user accumulates the gradients in different batch size and then invoke apply_grad, my proposal would not handle this.

I think I will close this issue for now, since the proposed change is not simple and elegant as the current implementation, although it has some advantages, they are not that important compared to keeping the simplicity of this project.