question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Resume training out of memory

See original GitHub issue

@dbolya I trained a custom model for many iterations and want to resume the training by adding more images. When I run

python train.py --config=yolact_base_config --resume=weights/yolact_base_3_345_interrupt.pth --start_iter=-1

it gives me the cuda out of memory error:

RuntimeError: CUDA out of memory. Tried to allocate 40.00 MiB (GPU 0; 5.80 GiB total capacity; 3.63 GiB already allocated; 16.81 MiB free; 80.22 MiB cached)

I trained the model using the same machine and setup. Do you have some siggestion on this issue? Thank you.

Issue Analytics

  • State:open
  • Created 4 years ago
  • Comments:11 (6 by maintainers)

github_iconTop GitHub Comments

1reaction
dbolyacommented, Feb 5, 2020

Wait that’s 10gb with a batch size of 2? Uhh that would be a problem.

I’m definitely not getting that big of a difference on my machines, but I’ll test to see if there’s any difference on my side.

0reactions
dbolyacommented, Feb 6, 2020

CUDA_VISIBLE_DEVICES sets which GPU to use in the case you have multiple GPUs. If you only have one GPU, it shouldn’t matter. So uh, this is weird.

My environment is:

Ubuntu 18.04
Python 3.7.3
CUDA 10.1.243
CuDNN 7603

Of the libraries that YOLACT uses, I have:

Cython==0.29.13
matplotlib==3.1.1
numpy==1.17.2
opencv-python==4.1.1.26
Pillow==6.2.0
pycocotools==2.0.0
torch==1.3.0
torchvision==0.4.1

Note that you might be able to squeeze out more vram by downgrading to torch==1.0.1 and whatever torchvision package is appropriate for that version. We noticed a jump in vram usage from 1.0.1 to 1.1.0.

Read more comments on GitHub >

github_iconTop Results From Across the Web

Running out of memory when resume training. #12680 - GitHub
I will profile the memory usage and get back to you with the findings. I'm glad to hear that meanwhile you have a...
Read more >
Resuming training raises error "CUDA out of memory" - vision
It seems that there's something wrong with the loss but I cannot understand what's the problem.
Read more >
Resuming training fails with CUDA out of memory error
I am using my university's HPC cluster and there is a time limit per job. So I ran the train method of the...
Read more >
Resuming pytorch model training raises error “CUDA out of ...
2 : out of memory Traceback (most recent call last): File "train.py" ; in <module> train(train_loader, model, optimizer, epoch) File "train.py" ...
Read more >
Resume=True fails without useful error message - RLlib - Ray
High: It blocks me to complete my task. I'm so close to finishing my training (Ray 2.0.0) and I ran out of disk...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found