Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Resuming training repeats the last epoch

See original GitHub issue

Describe the bug When resuming the training from a checkpoint, the last epoch is repeated. So for example if the checkpoint was after 4 epochs of training, after resuming the epoch will be 4 again instead of 5.

To Reproduce Steps to reproduce the behavior:

Train the model with the default CheckpointCallback.
Try to resume training from any checkpoint with the default CheckpointCallback.

Colab notebook based on the MNIST tutorial: https://colab.research.google.com/drive/1hf1q0vrgc4Sh5tCnZR4WFsDkZJIzO9-e

import torch
from torch.utils.data import DataLoader, TensorDataset
from catalyst.dl import SupervisedRunner

logdir = "./logdir"

num_samples, num_features = int(1e4), int(1e1)
X, y = torch.rand(num_samples, num_features), torch.rand(num_samples)
dataset = TensorDataset(X, y)
loader = DataLoader(dataset, batch_size=32, num_workers=1)
loaders = {"train": loader, "valid": loader}

model = torch.nn.Linear(num_features, 1)
criterion = torch.nn.MSELoss()
optimizer = torch.optim.Adam(model.parameters())
scheduler = torch.optim.lr_scheduler.MultiStepLR(optimizer, [3, 6])

runner = SupervisedRunner()
runner.train(
    model=model,
    criterion=criterion,
    optimizer=optimizer,
    scheduler=scheduler,
    loaders=loaders,
    logdir=logdir,
    num_epochs=4  # Train for 4 epochs
)

# Let's train for 4 more epochs
# Issue: the training starts at epoch = 4 instead of epoch = 5,
# so the 4th epoch is repeated
runner.train(
    model=model,
    criterion=criterion,
    optimizer=optimizer,
    scheduler=scheduler,
    loaders=loaders,
    logdir=logdir,
    num_epochs=8,
    verbose=True,
    resume='logdir/checkpoints/last_full.pth'
)

Expected behavior The next epoch is selected. So if a checkpoint is after n epochs of training the n+1th epoch is selected.

Screenshots Note the vertical step on the 4th epoch because it was repeated.

Issue Analytics

State:
Created 3 years ago
Reactions:1
Comments:13 (9 by maintainers)

Top GitHub Comments

1reaction

kenluck2001commented, Oct 12, 2020

Is this issue still open? If it is not fixed, then I can use it as a first timer task.

0reactions

stale[bot]commented, Dec 11, 2020

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.