Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Resume training out of memory

See original GitHub issue

@dbolya I trained a custom model for many iterations and want to resume the training by adding more images. When I run

python train.py --config=yolact_base_config --resume=weights/yolact_base_3_345_interrupt.pth --start_iter=-1

it gives me the cuda out of memory error:

RuntimeError: CUDA out of memory. Tried to allocate 40.00 MiB (GPU 0; 5.80 GiB total capacity; 3.63 GiB already allocated; 16.81 MiB free; 80.22 MiB cached)

I trained the model using the same machine and setup. Do you have some siggestion on this issue? Thank you.

Issue Analytics

State:
Created 4 years ago
Comments:11 (6 by maintainers)

Top GitHub Comments

1reaction

dbolyacommented, Feb 5, 2020

Wait that’s 10gb with a batch size of 2? Uhh that would be a problem.

I’m definitely not getting that big of a difference on my machines, but I’ll test to see if there’s any difference on my side.

0reactions

dbolyacommented, Feb 6, 2020

CUDA_VISIBLE_DEVICES sets which GPU to use in the case you have multiple GPUs. If you only have one GPU, it shouldn’t matter. So uh, this is weird.

My environment is:

Ubuntu 18.04
Python 3.7.3
CUDA 10.1.243
CuDNN 7603

Of the libraries that YOLACT uses, I have:

Cython==0.29.13
matplotlib==3.1.1
numpy==1.17.2
opencv-python==4.1.1.26
Pillow==6.2.0
pycocotools==2.0.0
torch==1.3.0
torchvision==0.4.1

Note that you might be able to squeeze out more vram by downgrading to torch==1.0.1 and whatever torchvision package is appropriate for that version. We noticed a jump in vram usage from 1.0.1 to 1.1.0.