question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

RuntimeError: CUDA error: out of memory for "DRN-D-105" while testing

See original GitHub issue

Is anyone able to test the code for “DRN-D-105” architecture on test data?? I am able to train and validate but while testing error occurs as “RuntimeError: CUDA error: out of memory” even with small crop size = 256*256 and batchsize =1. I checked resources while testing and resources are free enough (both GPU memory and system RAM) I am using NVIDIA P100 GPU with 16 GB memory.

Any thought?

(bhakti) user@user:/mnt/komal/bhakti/anue$ python3 segment.py test -d dataset/ -c 26 --arch drn_d_105 --resume model_best.pth.tar --phase test --batch-size 1 -j2 segment.py test -d dataset/ -c 26 --arch drn_d_105 --resume model_best.pth.tar --phase test --batch-size 1 -j2 Namespace(arch=‘drn_d_105’, batch_size=1, bn_sync=False, classes=26, cmd=‘test’, crop_size=896, data_dir=‘dataset/’, epochs=10, evaluate=False, list_dir=None, load_rel=None, lr=0.01, lr_mode=‘step’, momentum=0.9, ms=False, phase=‘test’, pretrained=‘’, random_rotate=0, random_scale=0, resume=‘model_best.pth.tar’, step=200, test_suffix=‘’, weight_decay=0.0001, with_gt=False, workers=2) classes : 26 batch_size : 1 pretrained : momentum : 0.9 with_gt : False phase : test list_dir : None lr_mode : step weight_decay : 0.0001 epochs : 10 step : 200 bn_sync : False ms : False arch : drn_d_105 random_rotate : 0 random_scale : 0 workers : 2 crop_size : 896 lr : 0.01 load_rel : None resume : model_best.pth.tar evaluate : False cmd : test data_dir : dataset/ test_suffix : [2019-09-14 19:14:23,173 segment.py:697 test_seg] => loading checkpoint ‘model_best.pth.tar’ [2019-09-14 19:14:23,509 segment.py:703 test_seg] => loaded checkpoint ‘model_best.pth.tar’ (epoch 1) segment.py:540: UserWarning: volatile was removed and now has no effect. Use with torch.no_grad(): instead. image_var = Variable(image, requires_grad=False, volatile=True) Exception ignored in: <bound method _DataLoaderIter.del of <torch.utils.data.dataloader._DataLoaderIter object at 0x7f15eff61160>> Traceback (most recent call last): File “/home/user/anaconda2/envs/bhakti/lib/python3.5/site-packages/torch/utils/data/dataloader.py”, line 399, in del self._shutdown_workers() File “/home/user/anaconda2/envs/bhakti/lib/python3.5/site-packages/torch/utils/data/dataloader.py”, line 378, in _shutdown_workers self.worker_result_queue.get() File “/home/user/anaconda2/envs/bhakti/lib/python3.5/multiprocessing/queues.py”, line 337, in get return ForkingPickler.loads(res) File “/home/user/anaconda2/envs/bhakti/lib/python3.5/site-packages/torch/multiprocessing/reductions.py”, line 151, in rebuild_storage_fd fd = df.detach() File “/home/user/anaconda2/envs/bhakti/lib/python3.5/multiprocessing/resource_sharer.py”, line 58, in detach return reduction.recv_handle(conn) File “/home/user/anaconda2/envs/bhakti/lib/python3.5/multiprocessing/reduction.py”, line 181, in recv_handle return recvfds(s, 1)[0] File “/home/user/anaconda2/envs/bhakti/lib/python3.5/multiprocessing/reduction.py”, line 152, in recvfds msg, ancdata, flags, addr = sock.recvmsg(1, socket.CMSG_LEN(bytes_size)) ConnectionResetError: [Errno 104] Connection reset by peer Traceback (most recent call last): File “segment.py”, line 789, in <module> main() File “segment.py”, line 785, in main test_seg(args) File “segment.py”, line 720, in test_seg has_gt=phase != ‘test’ or args.with_gt, output_dir=out_dir) File “segment.py”, line 544, in test final = model(image_var)[0] File “/home/user/anaconda2/envs/bhakti/lib/python3.5/site-packages/torch/nn/modules/module.py”, line 477, in call result = self.forward(*input, **kwargs) File “/home/user/anaconda2/envs/bhakti/lib/python3.5/site-packages/torch/nn/parallel/data_parallel.py”, line 121, in forward return self.module(*inputs[0], **kwargs[0]) File “/home/user/anaconda2/envs/bhakti/lib/python3.5/site-packages/torch/nn/modules/module.py”, line 477, in call result = self.forward(*input, **kwargs) File “segment.py”, line 142, in forward y = self.up(x) File “/home/user/anaconda2/envs/bhakti/lib/python3.5/site-packages/torch/nn/modules/module.py”, line 477, in call result = self.forward(*input, **kwargs) File “/home/user/anaconda2/envs/bhakti/lib/python3.5/site-packages/torch/nn/modules/conv.py”, line 691, in forward output_padding, self.groups, self.dilation) RuntimeError: CUDA error: out of memory

Issue Analytics

  • State:open
  • Created 4 years ago
  • Comments:5

github_iconTop GitHub Comments

2reactions
taesungpcommented, Jul 25, 2020

It’s because the new pytorch deprecated volatile, which was used to disable gradient recording. The new recommended way is using torch.no_grad().

In the last line of segment.py, wrap main() with with torch.no_grad():

if __name__ == "__main__":
    with torch.no_grad():
        main()
0reactions
raven38commented, Jul 14, 2020
Read more comments on GitHub >

github_iconTop Results From Across the Web

"RuntimeError: CUDA error: out of memory" - Stack Overflow
The error occurs because you ran out of memory on your GPU. One way to solve it is to reduce the batch size...
Read more >
Solving the “RuntimeError: CUDA Out of memory” error
When using multi-gpu systems I'd recommend using the `CUDA_VISIBLE_DEVICES` environment variable to select the GPU to use. $ export ...
Read more >
Resolving CUDA Being Out of Memory With Gradient ...
Implementing gradient accumulation and automatic mixed precision to solve CUDA out of memory issue when training big deep learning models ...
Read more >
Solving "CUDA out of memory" Error - Kaggle
If you try to train multiple models on GPU, you are most likely to encounter some error similar to this one: RuntimeError: CUDA...
Read more >
Getting "RuntimeError: CUDA error: out of memory" when ...
I'm trying to run a test code on GPU of a remote machine. The code is import torch foo = torch.tensor([1,2,3]) foo =...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found