Training on a pre-trained model: RuntimeError: CUDA error: out of memory
See original GitHub issue🐛 Bug
I am launching training on a pretrained model and a 2 classes coco like dataset.
To Reproduce
Steps to reproduce the behavior:
- Run training with this command line
python tools/train_net.py --config-file "configs/myconfig.yaml" SOLVER.IMS_PER_BATCH 10 SOLVER.BASE_LR 0.0025 SOLVER.MAX_ITER 720000 SOLVER.STEPS "(480000, 640000)" TEST.IMS_PER_BATCH 1
Where myconfig.yaml points out to mymodel.pth like this:
WEIGHT: "/Users/karimimohammedbelhal/.torch/models/mymodel"
And mymodel.pth is a pre trained model with the right keys deleted as suggested in #15
Expected behavior
Training should start and complete.
Environment
PyTorch version: 1.0.0.dev20181123 Is debug build: No CUDA used to build PyTorch: 9.0.176
OS: Ubuntu 18.04.1 LTS GCC version: (Ubuntu 7.3.0-27ubuntu1~18.04) 7.3.0 CMake version: version 3.10.2
Python version: 3.7 Is CUDA available: Yes CUDA runtime version: 9.1.85 GPU models and configuration: GPU 0: GeForce GTX 1080 Ti GPU 1: GeForce GTX 1080 Ti
Nvidia driver version: 396.51 cuDNN version: Probably one of the following: /usr/lib/x86_64-linux-gnu/libcudnn.so.7.2.1 /usr/lib/x86_64-linux-gnu/libcudnn_static_v7.a
Versions of relevant libraries: [pip3] numpy (1.13.3) [pip3] torch (0.4.1) [pip3] torchvision (0.2.1) [conda] pytorch-nightly 1.0.0.dev20181123 py3.7_cuda9.0.176_cudnn7.4.1_0 pytorch
Returned Error
Traceback (most recent call last):
File "tools/train_net.py", line 170, in <module>
main()
File "tools/train_net.py", line 163, in main
model = train(cfg, args.local_rank, args.distributed)
File "tools/train_net.py", line 31, in train
model.to(device)
File "/home/belhal/anaconda3/envs/maskrcnn/lib/python3.7/site-packages/torch/nn/modules/module.py", line 381, in to
return self._apply(convert)
File "/home/belhal/anaconda3/envs/maskrcnn/lib/python3.7/site-packages/torch/nn/modules/module.py", line 187, in _apply
module._apply(fn)
File "/home/belhal/anaconda3/envs/maskrcnn/lib/python3.7/site-packages/torch/nn/modules/module.py", line 187, in _apply
module._apply(fn)
File "/home/belhal/anaconda3/envs/maskrcnn/lib/python3.7/site-packages/torch/nn/modules/module.py", line 187, in _apply
module._apply(fn)
[Previous line repeated 1 more time]
File "/home/belhal/anaconda3/envs/maskrcnn/lib/python3.7/site-packages/torch/nn/modules/module.py", line 193, in _apply
param.data = fn(param.data)
File "/home/belhal/anaconda3/envs/maskrcnn/lib/python3.7/site-packages/torch/nn/modules/module.py", line 379, in convert
return t.to(device, dtype if t.is_floating_point() else None, non_blocking)
RuntimeError: CUDA error: out of memory
Issue Analytics
- State:
- Created 5 years ago
- Comments:12 (4 by maintainers)
Top GitHub Comments
if u use single gpu to train a model, u should change the
IMS_PER_BATCH
to be small enough (e.g. IMS_PER_BATCH=2).I solved this problem by rebooting the server.