question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Resuming training repeats the last epoch

See original GitHub issue

Describe the bug When resuming the training from a checkpoint, the last epoch is repeated. So for example if the checkpoint was after 4 epochs of training, after resuming the epoch will be 4 again instead of 5.

To Reproduce Steps to reproduce the behavior:

  1. Train the model with the default CheckpointCallback.
  2. Try to resume training from any checkpoint with the default CheckpointCallback.

Colab notebook based on the MNIST tutorial: https://colab.research.google.com/drive/1hf1q0vrgc4Sh5tCnZR4WFsDkZJIzO9-e

import torch
from torch.utils.data import DataLoader, TensorDataset
from catalyst.dl import SupervisedRunner

logdir = "./logdir"

num_samples, num_features = int(1e4), int(1e1)
X, y = torch.rand(num_samples, num_features), torch.rand(num_samples)
dataset = TensorDataset(X, y)
loader = DataLoader(dataset, batch_size=32, num_workers=1)
loaders = {"train": loader, "valid": loader}

model = torch.nn.Linear(num_features, 1)
criterion = torch.nn.MSELoss()
optimizer = torch.optim.Adam(model.parameters())
scheduler = torch.optim.lr_scheduler.MultiStepLR(optimizer, [3, 6])

runner = SupervisedRunner()
runner.train(
    model=model,
    criterion=criterion,
    optimizer=optimizer,
    scheduler=scheduler,
    loaders=loaders,
    logdir=logdir,
    num_epochs=4  # Train for 4 epochs
)

# Let's train for 4 more epochs
# Issue: the training starts at epoch = 4 instead of epoch = 5,
# so the 4th epoch is repeated
runner.train(
    model=model,
    criterion=criterion,
    optimizer=optimizer,
    scheduler=scheduler,
    loaders=loaders,
    logdir=logdir,
    num_epochs=8,
    verbose=True,
    resume='logdir/checkpoints/last_full.pth'
)

Expected behavior The next epoch is selected. So if a checkpoint is after n epochs of training the n+1th epoch is selected.

Screenshots Note the vertical step on the 4th epoch because it was repeated. изображение

Issue Analytics

  • State:closed
  • Created 3 years ago
  • Reactions:1
  • Comments:13 (9 by maintainers)

github_iconTop GitHub Comments

1reaction
kenluck2001commented, Oct 12, 2020

Is this issue still open? If it is not fixed, then I can use it as a first timer task.

0reactions
stale[bot]commented, Dec 11, 2020

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

Read more comments on GitHub >

github_iconTop Results From Across the Web

How can I get the model to resume training from the epoch it ...
I stopped it at 10,000 but I want to it be able resume training from the epoch it left off on. Is there...
Read more >
How can i continue training from last epoch?
Hi @sam_mohel, To continue training from last epoch you have to create the model checkpoints which stores the model weights, then you have ......
Read more >
[ESPnet2] Resume training from best (valid) epoch instead of ...
Hi I'm trying to manipulate the number of training epochs to achieve training the same network with multiple datasets one-by-one.
Read more >
How to train network with additional epoch after end of ...
To restart the training process from the last epoch, first you need to load the trained weights/ checkpoints of last trained epoch. You...
Read more >
Effective Model Saving and Resuming Training in PyTorch
This blog post explores how to do proper model saving in PyTorch framework that helps in resuming training later on.
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found