Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

RuntimeError: CUDA error: out of memory for "DRN-D-105" while testing

See original GitHub issue

Is anyone able to test the code for “DRN-D-105” architecture on test data?? I am able to train and validate but while testing error occurs as “RuntimeError: CUDA error: out of memory” even with small crop size = 256*256 and batchsize =1. I checked resources while testing and resources are free enough (both GPU memory and system RAM) I am using NVIDIA P100 GPU with 16 GB memory.

Any thought?

(bhakti) user@user:/mnt/komal/bhakti/anue$ python3 segment.py test -d dataset/ -c 26 --arch drn_d_105 --resume model_best.pth.tar --phase test --batch-size 1 -j2 segment.py test -d dataset/ -c 26 --arch drn_d_105 --resume model_best.pth.tar --phase test --batch-size 1 -j2 Namespace(arch=‘drn_d_105’, batch_size=1, bn_sync=False, classes=26, cmd=‘test’, crop_size=896, data_dir=‘dataset/’, epochs=10, evaluate=False, list_dir=None, load_rel=None, lr=0.01, lr_mode=‘step’, momentum=0.9, ms=False, phase=‘test’, pretrained=‘’, random_rotate=0, random_scale=0, resume=‘model_best.pth.tar’, step=200, test_suffix=‘’, weight_decay=0.0001, with_gt=False, workers=2) classes : 26 batch_size : 1 pretrained : momentum : 0.9 with_gt : False phase : test list_dir : None lr_mode : step weight_decay : 0.0001 epochs : 10 step : 200 bn_sync : False ms : False arch : drn_d_105 random_rotate : 0 random_scale : 0 workers : 2 crop_size : 896 lr : 0.01 load_rel : None resume : model_best.pth.tar evaluate : False cmd : test data_dir : dataset/ test_suffix : [2019-09-14 19:14:23,173 segment.py:697 test_seg] => loading checkpoint ‘model_best.pth.tar’ [2019-09-14 19:14:23,509 segment.py:703 test_seg] => loaded checkpoint ‘model_best.pth.tar’ (epoch 1) segment.py:540: UserWarning: volatile was removed and now has no effect. Use with torch.no_grad(): instead. image_var = Variable(image, requires_grad=False, volatile=True) Exception ignored in: <bound method _DataLoaderIter.del of <torch.utils.data.dataloader._DataLoaderIter object at 0x7f15eff61160>> Traceback (most recent call last): File “/home/user/anaconda2/envs/bhakti/lib/python3.5/site-packages/torch/utils/data/dataloader.py”, line 399, in del self._shutdown_workers() File “/home/user/anaconda2/envs/bhakti/lib/python3.5/site-packages/torch/utils/data/dataloader.py”, line 378, in _shutdown_workers self.worker_result_queue.get() File “/home/user/anaconda2/envs/bhakti/lib/python3.5/multiprocessing/queues.py”, line 337, in get return ForkingPickler.loads(res) File “/home/user/anaconda2/envs/bhakti/lib/python3.5/site-packages/torch/multiprocessing/reductions.py”, line 151, in rebuild_storage_fd fd = df.detach() File “/home/user/anaconda2/envs/bhakti/lib/python3.5/multiprocessing/resource_sharer.py”, line 58, in detach return reduction.recv_handle(conn) File “/home/user/anaconda2/envs/bhakti/lib/python3.5/multiprocessing/reduction.py”, line 181, in recv_handle return recvfds(s, 1)[0] File “/home/user/anaconda2/envs/bhakti/lib/python3.5/multiprocessing/reduction.py”, line 152, in recvfds msg, ancdata, flags, addr = sock.recvmsg(1, socket.CMSG_LEN(bytes_size)) ConnectionResetError: [Errno 104] Connection reset by peer Traceback (most recent call last): File “segment.py”, line 789, in <module> main() File “segment.py”, line 785, in main test_seg(args) File “segment.py”, line 720, in test_seg has_gt=phase != ‘test’ or args.with_gt, output_dir=out_dir) File “segment.py”, line 544, in test final = model(image_var)[0] File “/home/user/anaconda2/envs/bhakti/lib/python3.5/site-packages/torch/nn/modules/module.py”, line 477, in call result = self.forward(*input, **kwargs) File “/home/user/anaconda2/envs/bhakti/lib/python3.5/site-packages/torch/nn/parallel/data_parallel.py”, line 121, in forward return self.module(*inputs[0], **kwargs[0]) File “/home/user/anaconda2/envs/bhakti/lib/python3.5/site-packages/torch/nn/modules/module.py”, line 477, in call result = self.forward(*input, **kwargs) File “segment.py”, line 142, in forward y = self.up(x) File “/home/user/anaconda2/envs/bhakti/lib/python3.5/site-packages/torch/nn/modules/module.py”, line 477, in call result = self.forward(*input, **kwargs) File “/home/user/anaconda2/envs/bhakti/lib/python3.5/site-packages/torch/nn/modules/conv.py”, line 691, in forward output_padding, self.groups, self.dilation) RuntimeError: CUDA error: out of memory

Issue Analytics

State:
Created 4 years ago
Comments:5

Top GitHub Comments

2reactions

taesungpcommented, Jul 25, 2020

It’s because the new pytorch deprecated volatile, which was used to disable gradient recording. The new recommended way is using torch.no_grad().

In the last line of segment.py, wrap main() with with torch.no_grad():

if __name__ == "__main__":
    with torch.no_grad():
        main()

0reactions

raven38commented, Jul 14, 2020

Because the crop_size argument is disabled while testing. The argument is enabled while only training. Please refer https://github.com/fyu/drn/blob/d75db2ee7070426db7a9264ee61cf489f8cf178c/segment.py#L632-L640 and https://github.com/fyu/drn/blob/d75db2ee7070426db7a9264ee61cf489f8cf178c/segment.py#L360-L383

Top Results From Across the Web

"RuntimeError: CUDA error: out of memory" - Stack Overflow

The error occurs because you ran out of memory on your GPU. One way to solve it is to reduce the batch size...

Solving the “RuntimeError: CUDA Out of memory” error

When using multi-gpu systems I'd recommend using the `CUDA_VISIBLE_DEVICES` environment variable to select the GPU to use. $ export ...

Resolving CUDA Being Out of Memory With Gradient ...

Implementing gradient accumulation and automatic mixed precision to solve CUDA out of memory issue when training big deep learning models ...

Solving "CUDA out of memory" Error - Kaggle

If you try to train multiple models on GPU, you are most likely to encounter some error similar to this one: RuntimeError: CUDA...

Getting "RuntimeError: CUDA error: out of memory" when ...

I'm trying to run a test code on GPU of a remote machine. The code is import torch foo = torch.tensor([1,2,3]) foo =...

Troubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.

Start Free

Top Related Reddit Thread

No results found

Top Related Tweet

No results found

Top Related Dev.to Post

No results found

RuntimeError: CUDA error: out of memory for "DRN-D-105" while testing

Issue Analytics

Top GitHub Comments

Top Results From Across the Web

Top Related Medium Post

Top Related StackOverflow Question

Troubleshoot Live Code

Top Related Reddit Thread

Top Related Hackernoon Post

Top Related Tweet

Top Related Dev.to Post

Top Related Hashnode Post

there is no _caffe.so after make pycaffe

MongoDB connection always fail