RuntimeError: CUDA error: out of memory for "DRN-D-105" while testing
See original GitHub issueIs anyone able to test the code for “DRN-D-105” architecture on test data?? I am able to train and validate but while testing error occurs as “RuntimeError: CUDA error: out of memory” even with small crop size = 256*256 and batchsize =1. I checked resources while testing and resources are free enough (both GPU memory and system RAM) I am using NVIDIA P100 GPU with 16 GB memory.
Any thought?
(bhakti) user@user:/mnt/komal/bhakti/anue$ python3 segment.py test -d dataset/ -c 26 --arch drn_d_105 --resume model_best.pth.tar --phase test --batch-size 1 -j2
segment.py test -d dataset/ -c 26 --arch drn_d_105 --resume model_best.pth.tar --phase test --batch-size 1 -j2
Namespace(arch=‘drn_d_105’, batch_size=1, bn_sync=False, classes=26, cmd=‘test’, crop_size=896, data_dir=‘dataset/’, epochs=10, evaluate=False, list_dir=None, load_rel=None, lr=0.01, lr_mode=‘step’, momentum=0.9, ms=False, phase=‘test’, pretrained=‘’, random_rotate=0, random_scale=0, resume=‘model_best.pth.tar’, step=200, test_suffix=‘’, weight_decay=0.0001, with_gt=False, workers=2)
classes : 26
batch_size : 1
pretrained :
momentum : 0.9
with_gt : False
phase : test
list_dir : None
lr_mode : step
weight_decay : 0.0001
epochs : 10
step : 200
bn_sync : False
ms : False
arch : drn_d_105
random_rotate : 0
random_scale : 0
workers : 2
crop_size : 896
lr : 0.01
load_rel : None
resume : model_best.pth.tar
evaluate : False
cmd : test
data_dir : dataset/
test_suffix :
[2019-09-14 19:14:23,173 segment.py:697 test_seg] => loading checkpoint ‘model_best.pth.tar’
[2019-09-14 19:14:23,509 segment.py:703 test_seg] => loaded checkpoint ‘model_best.pth.tar’ (epoch 1)
segment.py:540: UserWarning: volatile was removed and now has no effect. Use with torch.no_grad():
instead.
image_var = Variable(image, requires_grad=False, volatile=True)
Exception ignored in: <bound method _DataLoaderIter.del of <torch.utils.data.dataloader._DataLoaderIter object at 0x7f15eff61160>>
Traceback (most recent call last):
File “/home/user/anaconda2/envs/bhakti/lib/python3.5/site-packages/torch/utils/data/dataloader.py”, line 399, in del
self._shutdown_workers()
File “/home/user/anaconda2/envs/bhakti/lib/python3.5/site-packages/torch/utils/data/dataloader.py”, line 378, in _shutdown_workers
self.worker_result_queue.get()
File “/home/user/anaconda2/envs/bhakti/lib/python3.5/multiprocessing/queues.py”, line 337, in get
return ForkingPickler.loads(res)
File “/home/user/anaconda2/envs/bhakti/lib/python3.5/site-packages/torch/multiprocessing/reductions.py”, line 151, in rebuild_storage_fd
fd = df.detach()
File “/home/user/anaconda2/envs/bhakti/lib/python3.5/multiprocessing/resource_sharer.py”, line 58, in detach
return reduction.recv_handle(conn)
File “/home/user/anaconda2/envs/bhakti/lib/python3.5/multiprocessing/reduction.py”, line 181, in recv_handle
return recvfds(s, 1)[0]
File “/home/user/anaconda2/envs/bhakti/lib/python3.5/multiprocessing/reduction.py”, line 152, in recvfds
msg, ancdata, flags, addr = sock.recvmsg(1, socket.CMSG_LEN(bytes_size))
ConnectionResetError: [Errno 104] Connection reset by peer
Traceback (most recent call last):
File “segment.py”, line 789, in <module>
main()
File “segment.py”, line 785, in main
test_seg(args)
File “segment.py”, line 720, in test_seg
has_gt=phase != ‘test’ or args.with_gt, output_dir=out_dir)
File “segment.py”, line 544, in test
final = model(image_var)[0]
File “/home/user/anaconda2/envs/bhakti/lib/python3.5/site-packages/torch/nn/modules/module.py”, line 477, in call
result = self.forward(*input, **kwargs)
File “/home/user/anaconda2/envs/bhakti/lib/python3.5/site-packages/torch/nn/parallel/data_parallel.py”, line 121, in forward
return self.module(*inputs[0], **kwargs[0])
File “/home/user/anaconda2/envs/bhakti/lib/python3.5/site-packages/torch/nn/modules/module.py”, line 477, in call
result = self.forward(*input, **kwargs)
File “segment.py”, line 142, in forward
y = self.up(x)
File “/home/user/anaconda2/envs/bhakti/lib/python3.5/site-packages/torch/nn/modules/module.py”, line 477, in call
result = self.forward(*input, **kwargs)
File “/home/user/anaconda2/envs/bhakti/lib/python3.5/site-packages/torch/nn/modules/conv.py”, line 691, in forward
output_padding, self.groups, self.dilation)
RuntimeError: CUDA error: out of memory
Issue Analytics
- State:
- Created 4 years ago
- Comments:5
It’s because the new pytorch deprecated
volatile
, which was used to disable gradient recording. The new recommended way is usingtorch.no_grad()
.In the last line of
segment.py
, wrapmain()
withwith torch.no_grad():
Because the crop_size argument is disabled while testing. The argument is enabled while only training. Please refer https://github.com/fyu/drn/blob/d75db2ee7070426db7a9264ee61cf489f8cf178c/segment.py#L632-L640 and https://github.com/fyu/drn/blob/d75db2ee7070426db7a9264ee61cf489f8cf178c/segment.py#L360-L383