Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

reset_optimizer removes essential parameters in adaptive optimization algorithms

See original GitHub issue

🐛 Describe the bug The reset_optimizer function that is by default called at the beginning of every training experience at this line by the make_optimizer by default reinitializes the optimizer with the model’s parameters. This is done for all strategies that inherit base_strategy,applying it by default to all algorithms.

This does not cause problems if the optimizer is SGD. But the when the optimizer is Adam, RMSProp etc that crucially track the running mean, std and other statistics of each parameter as part of their algorithm - calling this function deletes all those statistics. Note that such adaptive optimizers are the ones most commonly used, notably also in this examples in avalanche.

🦋 Fix

I can work on fixing this.

My proposal:

Before the model_adaptation call, add a before_model_adaptation function that stores the current optimizer as an attribute in the object.

Then, in the reset_optimizer, determine the new parameters using the state_dict of the current model.

If there are no new parameters, the optimizer is not modified - this comprises many popular methods including the regularization-based, exemplar based methods, etc that do not expand the model

If the model has added parameters, and as a result, new keys are detected in the state_dict, then a new param_group is added to the optimizer. This leaves the previous optimizer and its running stats unchanged.

Finally, the stored previous optimizer is deleted using the del call.

This works for all current and (future) optimization algorithms.

Let me know if this works, and I’ll create a PR and solve this.

Issue Analytics

State:
Created 2 years ago
Comments:7 (3 by maintainers)

Top GitHub Comments

1reaction

AntonioCartacommented, Aug 25, 2021

Are you sure that this is necessary? Typically, whenever you have a new experience, you also have a domain shift (either new classes or a new domain). Therefore, I don’t expect the optimizer’s statistics to be relevant anymore and I think resetting them is correct. However, I never did an experimental comparison.

I think the current behavior is reasonable, and of course it can be changed as you explained above by the user themselves (if necessary). If you have a strong reason to change the default behavior (some experiment or a paper) I’m happy to change it as you propose.

Alternatively, we could add an example that shows how to retain the optimizer’s statistic and leave the default as is.

0reactions

ashok-arjuncommented, Nov 21, 2021

Yes - so that would mean that dynamic modules would still be having the problem, which we can look for a solution later.

In the case of static models, leaving the optimizers unchanged would do

Top Results From Across the Web

Various Optimization Algorithms For Training Neural Network

Optimization algorithms or strategies are responsible for reducing the losses and to provide the most accurate results possible. We'll learn about different ...

The Implicit Regularization for Adaptive Optimization ... - arXiv

In this paper, we study the implicit regularization of adaptive optimization algorithms when they are optimizing the logistic loss on homogeneous deep neural ......

The Implicit Bias for Adaptive Optimization Algorithms on ...

Except the first-order optimization algorithms like GD, adaptive algorithms ... In particular, we study the convergent direction of parameters when they are ...

On the Algorithmic Stability and Generalization of Adaptive ...

Recent years have witnessed a surge of interest in adaptive optimization methods for deep learning settings. For instance, Adam [1] — despite its...

Understanding Adaptive Optimization techniques in Deep ...

In Adadelta optimization technique it removes the learning rate parameter and replaces it with delta. Compiling the CNN model with Adadelta ...