question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Attempted to restart training on a COCO dataset after 2 epochs... failed with runtime error

See original GitHub issue

was training COCO… everything went smoothly and I managed to get into 3rd epoch and paused the training. Went to restart, and got errors. Unsure how to proceed with debugging or cleaning up…

CUDA_VISIBLE_DEVICES=0 python trainval_net.py --dataset coco --net res101 --bs 1 --nw 1 --lr .001 --lr_decay_step 10 --cuda --r true --checksession 1 --checkepoch 2 --checkpoint 234531 --use_tfb
234532 roidb entries
Loading pretrained weights from data/pretrained_model/resnet101_caffe.pth
loading checkpoint models/res101/coco/faster_rcnn_1_2_234531.pth
loaded checkpoint models/res101/coco/faster_rcnn_1_2_234531.pth
Traceback (most recent call last):
  File "trainval_net.py", line 339, in <module>
    optimizer.step()
  File "/python3.6/site-packages/torch/optim/sgd.py", line 101, in step
    buf.mul_(momentum).add_(1 - dampening, d_p)
RuntimeError: expected type torch.FloatTensor but got torch.cuda.FloatTensor
(pytorch1_py36) emcp@k:faster-rcnn.pytorch$ 

Issue Analytics

  • State:closed
  • Created 5 years ago
  • Comments:9 (5 by maintainers)

github_iconTop GitHub Comments

5reactions
AlexanderHustinxcommented, Apr 15, 2019

I ran into a similar problem. For me that issue was solved when moving the lines:

  if args.cuda:
    fasterRCNN.cuda()

above the assignment of the optimizer, i.e. above:

  if args.optimizer == "adam":
    lr = lr * 0.1
    optimizer = torch.optim.Adam(params)

  elif args.optimizer == "sgd":
    optimizer = torch.optim.SGD(params, momentum=cfg.TRAIN.MOMENTUM)

According to the documentation it is best practice to move the model to GPU prior to initialization/assignment of the optimizer.

Read more comments on GitHub >

github_iconTop Results From Across the Web

Changelog — MMDetection 2.26.0 documentation
Support splitting COCO data for Semi-supervised object detection (#7431) ... Fix two-stage runtime error given empty proposal (#5559).
Read more >
Training Tensorflow2 model : Failed to find any matching files ...
How can I fix this error? FileNotFoundError: Unsuccessful TensorSliceReader constructor: Failed to find any matching files for /home/UbuntuUser/ ...
Read more >
Multi-GPU Training - YOLOv5 Documentation
Multi-GPU Training. This guide explains how to properly use multiple GPUs to train a dataset with YOLOv5 on single or multiple machine(s).
Read more >
runtimeerror: cuda error: cublas_status_not_initialized - You.com ...
When I run the forward method, I got the issue 'RuntimeError: CUDA error: ... workers Logging results to runs\m6-c3cbam\exp7 Starting training for 1...
Read more >
Troubleshooting TensorFlow - TPU - Google Cloud
The TPU runtime attempts to optimize operators to fit the model in memory ... this is to start with 1024, and if this...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found