Runtime Error when resuming training
See original GitHub issueI was trainig using multiple GPUs on my own dataset, but when resuming training, I got this error
Loading pretrained weights from data/pretrained_model/vgg16_caffe.pth
loading checkpoint models/vgg16/virtual_sign_2019/faster_rcnn_1_3_1124.pth
loaded checkpoint models/vgg16/virtual_sign_2019/faster_rcnn_1_3_1124.pth
Traceback (most recent call last):
File "trainval_net.py", line 340, in <module>
optimizer.step()
File "/home/sy1806701/anaconda3/envs/pytorch/lib/python3.5/site-packages/torch/optim/sgd.py", line 101, in step
buf.mul_(momentum).add_(1 - dampening, d_p)
RuntimeError: The expanded size of the tensor (3) must match the existing size (25088) at non-singleton dimension 3
Environment: Pytorch 0.4.0 CUDA 9.0 cuDNN 7.1.2 Python 3.5 GPUs: 4 x Tesla V100
Command line I used:
CUDA_VISIBLE_DEVICES=2,3,4,5 python trainval_net.py --dataset virtual_sign_2019 --net vgg16 --bs 32 --nw 16 --lr 0.001 --cuda --mGPUs --r True --checksession 1 --checkepoch 3 --checkpoint 1124
I have tried everything I can to solve this problem, incluing many issue related to this, like #515 #475 #506 , but the problem still exists… is there any possilbe solution? thanks…
Issue Analytics
- State:
- Created 4 years ago
- Comments:14
Top Results From Across the Web
Runtime Error when resuming with loaded optimizor state
Runtime Error Occurs when resuming the training with loaded optimizor state. Here is the code snippet. # define optimizer # args.lr ...
Read more >Resuming pytorch model training raises error “CUDA out of ...
I trained the model for 2 epochs without errors and then I interrupted the process. I also killed the process that was leaved...
Read more >Error resuming from checkpoint with multiple GPUs
I started training a model on two GPUs, using the following trainer: trainer = pl.Trainer( devices = [0,2], accelerator='gpu', precision=16, ...
Read more >On Error Statement - Visual Basic | Microsoft Learn
On Error Resume Next causes execution to continue with the statement immediately following the statement that caused the run-time error, or with ...
Read more >How To Fix Runtime Error On Windows 10/11 [Tutorial]
How To Fix Runtime Error On Windows 10/11 [Tutorial]A runtime error occurs while a program is running or when you first attempt to...
Read more >Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start FreeTop Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Top GitHub Comments
Yes… if I just use the out-of-the-box code, everything works just fine when traing from scratch, either on single or multiple GPUs, but the error always occurs when resuming training…
But as I descripted before, if I comment these two lines
in
trainvla_net.py
when resuming training, the training process goes on normally, and I’ve tested the modified code on my own dataset, the loss converged normally and the mAP I got from test sets was also acceptable.I’m using
SGD
optimizer, so it seems that there isn’t any adverse effect so far if I don’t load the state dict of this opptimizer when resuming training. But it remains to be varified that if it has any negative effect on other optimizers likeAdam
.I’m using torchvision 0.2.1(Build py35_1).
thanks! I’ve encountered the same issue and this solution works for me.