question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Runtime Error when resuming training

See original GitHub issue

I was trainig using multiple GPUs on my own dataset, but when resuming training, I got this error

Loading pretrained weights from data/pretrained_model/vgg16_caffe.pth
loading checkpoint models/vgg16/virtual_sign_2019/faster_rcnn_1_3_1124.pth
loaded checkpoint models/vgg16/virtual_sign_2019/faster_rcnn_1_3_1124.pth
Traceback (most recent call last):
  File "trainval_net.py", line 340, in <module>
    optimizer.step()
  File "/home/sy1806701/anaconda3/envs/pytorch/lib/python3.5/site-packages/torch/optim/sgd.py", line 101, in step
    buf.mul_(momentum).add_(1 - dampening, d_p)
RuntimeError: The expanded size of the tensor (3) must match the existing size (25088) at non-singleton dimension 3

Environment: Pytorch 0.4.0 CUDA 9.0 cuDNN 7.1.2 Python 3.5 GPUs: 4 x Tesla V100

Command line I used:

CUDA_VISIBLE_DEVICES=2,3,4,5 python trainval_net.py --dataset virtual_sign_2019 --net vgg16 --bs 32 --nw 16 --lr 0.001 --cuda --mGPUs --r True --checksession 1 --checkepoch 3 --checkpoint 1124

I have tried everything I can to solve this problem, incluing many issue related to this, like #515 #475 #506 , but the problem still exists… is there any possilbe solution? thanks…

Issue Analytics

  • State:open
  • Created 4 years ago
  • Comments:14

github_iconTop GitHub Comments

4reactions
HViktorTsoicommented, May 7, 2019

Yes… if I just use the out-of-the-box code, everything works just fine when traing from scratch, either on single or multiple GPUs, but the error always occurs when resuming training…

But as I descripted before, if I comment these two lines

# optimizer.load_state_dict(checkpoint['optimizer'])
# lr = optimizer.param_groups[0]['lr']

in trainvla_net.py when resuming training, the training process goes on normally, and I’ve tested the modified code on my own dataset, the loss converged normally and the mAP I got from test sets was also acceptable.

I’m using SGD optimizer, so it seems that there isn’t any adverse effect so far if I don’t load the state dict of this opptimizer when resuming training. But it remains to be varified that if it has any negative effect on other optimizers like Adam.

I’m using torchvision 0.2.1(Build py35_1).

And you still get the same error?

Everything should work pretty much out-of-the-box; git pull and run… As a sanity check, have you tried to simply run everything normally (not resuming, single gpu, etc.)? And then resuming with default parameters? Steadily working your way up to the full version of what you want to run.

EDIT: What version of torchvision are you using?

1reaction
syr-cncommented, Jun 27, 2022

thanks! I’ve encountered the same issue and this solution works for me.

Yes… if I just use the out-of-the-box code, everything works just fine when traing from scratch, either on single or multiple GPUs, but the error always occurs when resuming training…

But as I descripted before, if I comment these two lines

# optimizer.load_state_dict(checkpoint['optimizer'])
# lr = optimizer.param_groups[0]['lr']

in trainvla_net.py when resuming training, the training process goes on normally, and I’ve tested the modified code on my own dataset, the loss converged normally and the mAP I got from test sets was also acceptable.

I’m using SGD optimizer, so it seems that there isn’t any adverse effect so far if I don’t load the state dict of this opptimizer when resuming training. But it remains to be varified that if it has any negative effect on other optimizers like Adam.

I’m using torchvision 0.2.1(Build py35_1).

And you still get the same error? Everything should work pretty much out-of-the-box; git pull and run… As a sanity check, have you tried to simply run everything normally (not resuming, single gpu, etc.)? And then resuming with default parameters? Steadily working your way up to the full version of what you want to run. EDIT: What version of torchvision are you using?

Read more comments on GitHub >

github_iconTop Results From Across the Web

Runtime Error when resuming with loaded optimizor state
Runtime Error Occurs when resuming the training with loaded optimizor state. Here is the code snippet. # define optimizer # args.lr ...
Read more >
Resuming pytorch model training raises error “CUDA out of ...
I trained the model for 2 epochs without errors and then I interrupted the process. I also killed the process that was leaved...
Read more >
Error resuming from checkpoint with multiple GPUs
I started training a model on two GPUs, using the following trainer: trainer = pl.Trainer( devices = [0,2], accelerator='gpu', precision=16, ...
Read more >
On Error Statement - Visual Basic | Microsoft Learn
On Error Resume Next causes execution to continue with the statement immediately following the statement that caused the run-time error, or with ...
Read more >
How To Fix Runtime Error On Windows 10/11 [Tutorial]
How To Fix Runtime Error On Windows 10/11 [Tutorial]A runtime error occurs while a program is running or when you first attempt to...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found