Resuming training repeats the last epoch
See original GitHub issueDescribe the bug When resuming the training from a checkpoint, the last epoch is repeated. So for example if the checkpoint was after 4 epochs of training, after resuming the epoch will be 4 again instead of 5.
To Reproduce Steps to reproduce the behavior:
- Train the model with the default
CheckpointCallback
. - Try to resume training from any checkpoint with the default
CheckpointCallback
.
Colab notebook based on the MNIST tutorial: https://colab.research.google.com/drive/1hf1q0vrgc4Sh5tCnZR4WFsDkZJIzO9-e
import torch
from torch.utils.data import DataLoader, TensorDataset
from catalyst.dl import SupervisedRunner
logdir = "./logdir"
num_samples, num_features = int(1e4), int(1e1)
X, y = torch.rand(num_samples, num_features), torch.rand(num_samples)
dataset = TensorDataset(X, y)
loader = DataLoader(dataset, batch_size=32, num_workers=1)
loaders = {"train": loader, "valid": loader}
model = torch.nn.Linear(num_features, 1)
criterion = torch.nn.MSELoss()
optimizer = torch.optim.Adam(model.parameters())
scheduler = torch.optim.lr_scheduler.MultiStepLR(optimizer, [3, 6])
runner = SupervisedRunner()
runner.train(
model=model,
criterion=criterion,
optimizer=optimizer,
scheduler=scheduler,
loaders=loaders,
logdir=logdir,
num_epochs=4 # Train for 4 epochs
)
# Let's train for 4 more epochs
# Issue: the training starts at epoch = 4 instead of epoch = 5,
# so the 4th epoch is repeated
runner.train(
model=model,
criterion=criterion,
optimizer=optimizer,
scheduler=scheduler,
loaders=loaders,
logdir=logdir,
num_epochs=8,
verbose=True,
resume='logdir/checkpoints/last_full.pth'
)
Expected behavior The next epoch is selected. So if a checkpoint is after n epochs of training the n+1th epoch is selected.
Screenshots
Note the vertical step on the 4th epoch because it was repeated.
Issue Analytics
- State:
- Created 3 years ago
- Reactions:1
- Comments:13 (9 by maintainers)
Top Results From Across the Web
How can I get the model to resume training from the epoch it ...
I stopped it at 10,000 but I want to it be able resume training from the epoch it left off on. Is there...
Read more >How can i continue training from last epoch?
Hi @sam_mohel, To continue training from last epoch you have to create the model checkpoints which stores the model weights, then you have ......
Read more >[ESPnet2] Resume training from best (valid) epoch instead of ...
Hi I'm trying to manipulate the number of training epochs to achieve training the same network with multiple datasets one-by-one.
Read more >How to train network with additional epoch after end of ...
To restart the training process from the last epoch, first you need to load the trained weights/ checkpoints of last trained epoch. You...
Read more >Effective Model Saving and Resuming Training in PyTorch
This blog post explores how to do proper model saving in PyTorch framework that helps in resuming training later on.
Read more >
Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free
Top Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Is this issue still open? If it is not fixed, then I can use it as a first timer task.
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.