Error in loading a stored checkpoint
See original GitHub issueHello,
When I load a stored checkpoint, I get the following error:
RuntimeError: output with shape [128, 3, 1] doesn’t match the broadcast shape [128, 3, 3]
If I am reading a state_dict correctly, then I think there is probably a bug in you load_state_dict
. For your convenience I slightly modified your CompressAI/examples/train.py
example to accept also a checkpoint as input to continue from a previously stored checkpoint. For that, you just need to run the following code twice (I used bmshj2018-hyperprior
model):
- Once without
--checkpoint-file
for 1-2 epochs just to save a checkpoint - Then run the script with
--checkpoint-file [/address/of/stored/checkpoint]
.
# Copyright 2020 InterDigital Communications, Inc.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
import argparse
import math
import random
import shutil
import sys
import torch
import torch.nn as nn
import torch.optim as optim
from torch.utils.data import DataLoader
from torchvision import transforms
from compressai.datasets import ImageFolder
from compressai.zoo import models
class RateDistortionLoss(nn.Module):
"""Custom rate distortion loss with a Lagrangian parameter."""
def __init__(self, lmbda=1e-2):
super().__init__()
self.mse = nn.MSELoss()
self.lmbda = lmbda
def forward(self, output, target):
N, _, H, W = target.size()
out = {}
num_pixels = N * H * W
out["bpp_loss"] = sum(
(torch.log(likelihoods).sum() / (-math.log(2) * num_pixels))
for likelihoods in output["likelihoods"].values()
)
out["mse_loss"] = self.mse(output["x_hat"], target)
out["loss"] = self.lmbda * 255 ** 2 * out["mse_loss"] + out["bpp_loss"]
return out
class AverageMeter:
"""Compute running average."""
def __init__(self):
self.val = 0
self.avg = 0
self.sum = 0
self.count = 0
def update(self, val, n=1):
self.val = val
self.sum += val * n
self.count += n
self.avg = self.sum / self.count
class CustomDataParallel(nn.DataParallel):
"""Custom DataParallel to access the module methods."""
def __getattr__(self, key):
try:
return super().__getattr__(key)
except AttributeError:
return getattr(self.module, key)
def configure_optimizers(net, args):
"""Separate parameters for the main optimizer and the auxiliary optimizer.
Return two optimizers"""
parameters = set(
p for n, p in net.named_parameters() if not n.endswith(".quantiles")
)
aux_parameters = set(
p for n, p in net.named_parameters() if n.endswith(".quantiles")
)
# Make sure we don't have an intersection of parameters
params_dict = dict(net.named_parameters())
inter_params = parameters & aux_parameters
union_params = parameters | aux_parameters
assert len(inter_params) == 0
assert len(union_params) - len(params_dict.keys()) == 0
optimizer = optim.Adam(
(p for p in parameters if p.requires_grad),
lr=args.learning_rate,
)
aux_optimizer = optim.Adam(
(p for p in aux_parameters if p.requires_grad),
lr=args.aux_learning_rate,
)
return optimizer, aux_optimizer
def train_one_epoch(
model, criterion, train_dataloader, optimizer, aux_optimizer, epoch, clip_max_norm
):
model.train()
device = next(model.parameters()).device
for i, d in enumerate(train_dataloader):
d = d.to(device)
optimizer.zero_grad()
aux_optimizer.zero_grad()
out_net = model(d)
out_criterion = criterion(out_net, d)
out_criterion["loss"].backward()
if clip_max_norm > 0:
torch.nn.utils.clip_grad_norm_(model.parameters(), clip_max_norm)
optimizer.step()
aux_loss = model.aux_loss()
aux_loss.backward()
aux_optimizer.step()
if i % 10 == 0:
print(
f"Train epoch {epoch}: ["
f"{i*len(d)}/{len(train_dataloader.dataset)}"
f" ({100. * i / len(train_dataloader):.0f}%)]"
f'\tLoss: {out_criterion["loss"].item():.3f} |'
f'\tMSE loss: {out_criterion["mse_loss"].item():.3f} |'
f'\tBpp loss: {out_criterion["bpp_loss"].item():.2f} |'
f"\tAux loss: {aux_loss.item():.2f}"
)
def test_epoch(epoch, test_dataloader, model, criterion):
model.eval()
device = next(model.parameters()).device
loss = AverageMeter()
bpp_loss = AverageMeter()
mse_loss = AverageMeter()
aux_loss = AverageMeter()
with torch.no_grad():
for d in test_dataloader:
d = d.to(device)
out_net = model(d)
out_criterion = criterion(out_net, d)
aux_loss.update(model.aux_loss())
bpp_loss.update(out_criterion["bpp_loss"])
loss.update(out_criterion["loss"])
mse_loss.update(out_criterion["mse_loss"])
print(
f"Test epoch {epoch}: Average losses:"
f"\tLoss: {loss.avg:.3f} |"
f"\tMSE loss: {mse_loss.avg:.3f} |"
f"\tBpp loss: {bpp_loss.avg:.2f} |"
f"\tAux loss: {aux_loss.avg:.2f}\n"
)
return loss.avg
def save_checkpoint(state, is_best, filename="checkpoint.pth.tar"):
torch.save(state, filename)
if is_best:
shutil.copyfile(filename, "checkpoint_best_loss.pth.tar")
def parse_args(argv):
parser = argparse.ArgumentParser(description="Example training script.")
parser.add_argument(
"-m",
"--model",
default="bmshj2018-factorized",
choices=models.keys(),
help="Model architecture (default: %(default)s)",
)
parser.add_argument(
"-d", "--dataset", type=str, required=True, help="Training dataset"
)
parser.add_argument(
"-e",
"--epochs",
default=100,
type=int,
help="Number of epochs (default: %(default)s)",
)
parser.add_argument(
"-lr",
"--learning-rate",
default=1e-4,
type=float,
help="Learning rate (default: %(default)s)",
)
parser.add_argument(
"-n",
"--num-workers",
type=int,
default=30,
help="Dataloaders threads (default: %(default)s)",
)
parser.add_argument(
"--lambda",
dest="lmbda",
type=float,
default=1e-2,
help="Bit-rate distortion parameter (default: %(default)s)",
)
parser.add_argument(
"--batch-size", type=int, default=16, help="Batch size (default: %(default)s)"
)
parser.add_argument(
"--test-batch-size",
type=int,
default=64,
help="Test batch size (default: %(default)s)",
)
parser.add_argument(
"--aux-learning-rate",
default=1e-3,
help="Auxiliary loss learning rate (default: %(default)s)",
)
parser.add_argument(
"--patch-size",
type=int,
nargs=2,
default=(256, 256),
help="Size of the patches to be cropped (default: %(default)s)",
)
parser.add_argument("--cuda", action="store_true", help="Use cuda")
parser.add_argument("--save", action="store_true", help="Save model to disk")
parser.add_argument(
"--seed", type=float, help="Set random seed for reproducibility"
)
parser.add_argument(
"--clip_max_norm",
default=1.0,
type=float,
help="gradient clipping max norm (default: %(default)s",
)
parser.add_argument('--checkpoint-file', type=str, help='File address to resume training from the previous saved checkpoint')
args = parser.parse_args(argv)
return args
def main(argv):
args = parse_args(argv)
if args.seed is not None:
torch.manual_seed(args.seed)
random.seed(args.seed)
train_transforms = transforms.Compose(
[transforms.RandomCrop(args.patch_size), transforms.ToTensor()]
)
test_transforms = transforms.Compose(
[transforms.CenterCrop(args.patch_size), transforms.ToTensor()]
)
train_dataset = ImageFolder(args.dataset, split="train", transform=train_transforms)
test_dataset = ImageFolder(args.dataset, split="test", transform=test_transforms)
train_dataloader = DataLoader(
train_dataset,
batch_size=args.batch_size,
num_workers=args.num_workers,
shuffle=True,
pin_memory=True,
)
test_dataloader = DataLoader(
test_dataset,
batch_size=args.test_batch_size,
num_workers=args.num_workers,
shuffle=False,
pin_memory=True,
)
device = "cuda" if args.cuda and torch.cuda.is_available() else "cpu"
net = models[args.model](quality=3)
net = net.to(device)
if args.cuda and torch.cuda.device_count() > 1:
net = CustomDataParallel(net)
optimizer, aux_optimizer = configure_optimizers(net, args)
lr_scheduler = optim.lr_scheduler.ReduceLROnPlateau(optimizer, "min")
criterion = RateDistortionLoss(lmbda=args.lmbda)
last_epoch = -1
if args.checkpoint_file: # load from previous checkpoint
print("Loading", args.checkpoint_file)
checkpoint = torch.load(args.checkpoint_file, map_location=device)
last_epoch = checkpoint["epoch"]
net.load_state_dict((checkpoint["state_dict"]))
net.update(force=True) # update the model CDFs parameters.
optimizer.load_state_dict((checkpoint["optimizer"]))
aux_optimizer.load_state_dict((checkpoint["aux_optimizer"]))
lr_scheduler.load_state_dict((checkpoint["lr_scheduler"]))
best_loss = 1e10
for epoch in range(last_epoch + 1, args.epochs):
print(f"Learning rate: {optimizer.param_groups[0]['lr']}")
train_one_epoch(
net,
criterion,
train_dataloader,
optimizer,
aux_optimizer,
epoch,
args.clip_max_norm,
)
loss = test_epoch(epoch, test_dataloader, net, criterion)
lr_scheduler.step(loss)
is_best = loss < best_loss
best_loss = min(loss, best_loss)
if args.save:
save_checkpoint(
{
"epoch": epoch,
"state_dict": net.state_dict(),
"loss": loss,
"optimizer": optimizer.state_dict(),
"aux_optimizer": aux_optimizer.state_dict(),
"lr_scheduler": lr_scheduler.state_dict(),
},
is_best,
)
if __name__ == "__main__":
main(sys.argv[1:])
Issue Analytics
- State:
- Created 3 years ago
- Comments:13 (7 by maintainers)
Top Results From Across the Web
Error loading saved checkpoint in TFSlim - tensorflow
There was a mismatch between the variables generated by the model definition and the variables stored in the ckpt file.
Read more >How to Fix the Error: Hyper-V Checkpoint Operation Failed
Common reasons for the Hyper-V checkpoint operation failed error and similar checkpoint related errors are incorrect VM folder permissions, VSS ...
Read more >Troubleshooting Check Point logging issues when Security ...
In SmartConsole, go to ' Policy ' menu - click on ' Install Database... ' - select the Security Management Server and Log...
Read more >Loading model from checkpoint after error in training - Beginners
Hi, I have a question. I tried to load weights from a checkpoint like below. config = AutoConfig.from_pretrained("./saved/checkpoint-480000") ...
Read more >TPT2706 Error: Failed to read checkpoint files due to '%s'.
TPT1537 Error: Insufficient main storage to allocate Configuration filename ... TPT2191 Error: Cannot load shared library shl_load error: %s ...
Read more >
Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free
Top Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
So, this line
optimizer.load_state_dict((checkpoint["optimizer"]))
results in theRuntimeError
. It has to do with the use ofset
in the optimizer definitions. Sinceset
s are not sorted, the indexing of the parameter info in the optimizer state dict is different between initializations. Took me a while to figure out the issue since internally I coded it correctly 😉. I’ll push a fix soon.Thanks again for reporting!
Hi Navid, i’ll keep our current implementation as it’s easier to maintain w.r.t to our internal codebase. Also our intention was just to provide an example training for people to get started and take it from here.