question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Out of memory during training

See original GitHub issue

I am running out of memory on every epoch: I have merged A4 and TED datasets and trying to train on the merged dataset and I am getting out of memory every epoch:

Epoch: [4][13/4336]	Time 0.538 (0.680)	Data 0.003 (0.003)	Loss 58.5850 (69.9133)	
THCudaCheck FAIL file=/opt/conda/conda-bld/pytorch_1524590031827/work/aten/src/THC/generic/THCStorage.cu line=58 error=2 : out of memory
Traceback (most recent call last):
  File "train.py", line 304, in <module>
    loss.backward()
  File "/miniconda/envs/py36/lib/python3.6/site-packages/torch/tensor.py", line 93, in backward
    torch.autograd.backward(self, gradient, retain_graph, create_graph)
  File "/miniconda/envs/py36/lib/python3.6/site-packages/torch/autograd/__init__.py", line 89, in backward
    allow_unreachable=True)  # allow_unreachable flag
RuntimeError: cuda runtime error (2) : out of memory at /opt/conda/conda-bld/pytorch_1524590031827/work/aten/src/THC/generic/THCStorage.cu:58

Is any way to set max process gpu memory in pytorch similar to TF:

sess_config = tf.ConfigProto()
sess_config.gpu_options.per_process_gpu_memory_fraction = 0.90

Fortunately I am able to resume using checkpoints. Seems relevant to issue #172

Issue Analytics

  • State:closed
  • Created 5 years ago
  • Comments:14 (4 by maintainers)

github_iconTop GitHub Comments

8reactions
oguzelibolcommented, Jun 28, 2018

I’m having the same issue - has nothing to do with the batch size, GPU memory keeps increasing regardless. I am using CUDA8 and pytorch 0.4.0 with Python 3.5. Has anyone figured a solution to this?

1reaction
zzvaracommented, May 19, 2019

Same issue here on Pytorch 1.0.0 and latest warp-ctc with latest pytorch audio. Cuda goes OOM irrespective or layer dimensions or batch size.

Read more comments on GitHub >

github_iconTop Results From Across the Web

CUDA out of memory during training - PyTorch Forums
Hello, I am pretty new to machine learning and I am facing an issue I cannot solve by myself. I took this code...
Read more >
Cuda out of memory during evaluation but training is fine
Hi, I am finetuning a BARTForConditionalGeneration model. I am using Trainer from the library to train so I do not use anything fancy....
Read more >
Resolving CUDA Being Out of Memory With Gradient ...
Implementing gradient accumulation and automatic mixed precision to solve CUDA out of memory issue when training big deep learning models ...
Read more >
Runtime error: CUDA out of memory by the end of training and ...
The problem is your loss_train list, which stores all losses from the beginning of your experiment. If the losses you put in were...
Read more >
Out of memory during training - Jetson Nano
I am following the “Hello AI world” of Nvidia on my new Jetson-Nano dev kit (4GB). in the 3rd video ( here) ,...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found