Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Out of memory during training

See original GitHub issue

I am running out of memory on every epoch: I have merged A4 and TED datasets and trying to train on the merged dataset and I am getting out of memory every epoch:

Epoch: [4][13/4336]	Time 0.538 (0.680)	Data 0.003 (0.003)	Loss 58.5850 (69.9133)	
THCudaCheck FAIL file=/opt/conda/conda-bld/pytorch_1524590031827/work/aten/src/THC/generic/THCStorage.cu line=58 error=2 : out of memory
Traceback (most recent call last):
  File "train.py", line 304, in <module>
    loss.backward()
  File "/miniconda/envs/py36/lib/python3.6/site-packages/torch/tensor.py", line 93, in backward
    torch.autograd.backward(self, gradient, retain_graph, create_graph)
  File "/miniconda/envs/py36/lib/python3.6/site-packages/torch/autograd/__init__.py", line 89, in backward
    allow_unreachable=True)  # allow_unreachable flag
RuntimeError: cuda runtime error (2) : out of memory at /opt/conda/conda-bld/pytorch_1524590031827/work/aten/src/THC/generic/THCStorage.cu:58

Is any way to set max process gpu memory in pytorch similar to TF:

sess_config = tf.ConfigProto()
sess_config.gpu_options.per_process_gpu_memory_fraction = 0.90

Fortunately I am able to resume using checkpoints. Seems relevant to issue #172

Issue Analytics

State:
Created 5 years ago
Comments:14 (4 by maintainers)

Top GitHub Comments

8reactions

oguzelibolcommented, Jun 28, 2018

I’m having the same issue - has nothing to do with the batch size, GPU memory keeps increasing regardless. I am using CUDA8 and pytorch 0.4.0 with Python 3.5. Has anyone figured a solution to this?

1reaction

zzvaracommented, May 19, 2019

Same issue here on Pytorch 1.0.0 and latest warp-ctc with latest pytorch audio. Cuda goes OOM irrespective or layer dimensions or batch size.

Top Results From Across the Web

CUDA out of memory during training - PyTorch Forums

Hello, I am pretty new to machine learning and I am facing an issue I cannot solve by myself. I took this code...

Cuda out of memory during evaluation but training is fine

Hi, I am finetuning a BARTForConditionalGeneration model. I am using Trainer from the library to train so I do not use anything fancy....

Resolving CUDA Being Out of Memory With Gradient ...

Implementing gradient accumulation and automatic mixed precision to solve CUDA out of memory issue when training big deep learning models ...

Runtime error: CUDA out of memory by the end of training and ...

The problem is your loss_train list, which stores all losses from the beginning of your experiment. If the losses you put in were...

Out of memory during training - Jetson Nano

I am following the “Hello AI world” of Nvidia on my new Jetson-Nano dev kit (4GB). in the 3rd video ( here) ,...

Troubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.

Start Free

Top Related Reddit Thread

No results found

Top Related Tweet

No results found

Top Related Dev.to Post

No results found

Out of memory during training

Issue Analytics

Top GitHub Comments

Top Results From Across the Web

Top Related Medium Post

Top Related StackOverflow Question

Troubleshoot Live Code

Top Related Reddit Thread

Top Related Hackernoon Post

Top Related Tweet

Top Related Dev.to Post

Top Related Hashnode Post

Error when training on multi-GPU

transcribe.py: Segmentation fault when loading Mozilla Deepspeech's language model