Older implementation of learning without forgetting performed better
See original GitHub issue🐛 Describe the bug The old implementation of LwF performed much better than the current one. Is this working as intended or has a bug been introduced? As a further note what sort of testing would prevent this sort of regression?
🐜 To Reproduce
from torch.optim import SGD
from torch.nn import CrossEntropyLoss
from avalanche.benchmarks.classic import SplitMNIST
from avalanche.evaluation.metrics import accuracy_metrics
from avalanche.models import SimpleMLP
from avalanche.logging import InteractiveLogger
from avalanche.training.plugins import EvaluationPlugin
from avalanche.training.strategies import LwF
scenario = SplitMNIST(n_experiences=5)
model = SimpleMLP(num_classes=scenario.n_classes)
eval_plugin = EvaluationPlugin(
accuracy_metrics(minibatch=True, epoch=True, experience=True, stream=True),
loggers=[InteractiveLogger()]
)
cl_strategy = LwF(model, SGD(model.parameters(), lr=0.001, momentum=0.9),
CrossEntropyLoss(), train_mb_size=500, train_epochs=1, eval_mb_size=100, alpha=10, temperature=2.0,
evaluator=eval_plugin)
print('Starting experiment...')
for i, experience in enumerate(scenario.train_stream):
print("Start of experience: ", experience.current_experience)
print("Current Classes: ", experience.classes_in_this_experience)
res = cl_strategy.train(experience)
print('Training completed')
print('Computing accuracy on the whole test set')
cl_strategy.eval(scenario.test_stream[:i+1])
New implementation
100%|██████████| 19/19 [00:00<00:00, 52.78it/s]
> Eval on experience 4 (Task 0) from test stream ended.
Top1_Acc_Exp/eval_phase/test_stream/Task000/Exp004 = 0.0021
-- >> End of eval phase << --
Top1_Acc_Stream/eval_phase/test_stream/Task000 = 0.2298
Old implementation as of 5356591e2355fbf2aa3d5c0dd5f7bc8613991cff
100%|██████████| 21/21 [00:00<00:00, 47.12it/s]
> Eval on experience 4 (Task 0) from test stream ended.
Top1_Acc_Exp/eval_phase/test_stream/Task000/Exp004 = 0.9178
-- >> End of eval phase << --
Top1_Acc_Stream/eval_phase/test_stream = 0.4413
Issue Analytics
- State:
- Created 2 years ago
- Comments:5 (3 by maintainers)
Top Results From Across the Web
Learning without Forgetting - arXiv
We experiment on adjusting the balance between old-new task losses, pro- viding a more thorough and intuitive comparison of related methods (Figure 7)....
Read more >Learning without Forgetting - IEEE Xplore
We experiment on adjusting the balance between old-new task losses, pro- viding a more thorough and intuitive comparison of related methods (Figure 7)....
Read more >Learning Without Forgetting Simplified - Towards Data Science
The purpose of Learning without Forgetting (LwF) is to learn a network that can perform well on both old tasks and new tasks...
Read more >Implementation of Learning without Forgetting - Artificial Intelligence ...
Does anybody have a sample implementation of Learning without Forgetting Continual learning method trained in different dataset? For example for first task, ...
Read more >LEARNING TO LEARN WITHOUT FORGETTING BY ...
This method learns parameters that make inter- ference based on future gradients less likely and transfer based on future gradients more likely.1 We...
Read more >Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start FreeTop Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Top GitHub Comments
I see. Thanks for the clarification, I greatly appreciate it. I’m glad it’s working as intended and we have the reproducible continual learning repository as a way to detect these sort of issues.
@AntonioCarta The point of soft-max is that it illuminates the slight differences in the output layer. Essentially the teacher network is saying “that potato looks like a potato but it also looks just a bit like a dog”. The issue seems to be the fact that “dark knowledge” is lost when only a few activation units are used. A problem that is very significant in some of my research, because I am using a pre-trained resnet.
The issue that training (the traditional part) drives all the probabilities down is notable. Its true that before training on an experiences that experiences’ activation are meaningless at least for a randomly initialised model. However I think it’s clear empirically that the additional knowledge (or sort of regularisation) is significant (especially if a pretrained model is used). I concede that my original idea to add useless outputs is a little silly, but a revised version would be to use a different non-final layer close to the end to perform distillation with. I’m not suggesting we do that for LwF but it might be advantageous.