Resume training out of memory
See original GitHub issue@dbolya I trained a custom model for many iterations and want to resume the training by adding more images. When I run
python train.py --config=yolact_base_config --resume=weights/yolact_base_3_345_interrupt.pth --start_iter=-1
it gives me the cuda out of memory error:
RuntimeError: CUDA out of memory. Tried to allocate 40.00 MiB (GPU 0; 5.80 GiB total capacity; 3.63 GiB already allocated; 16.81 MiB free; 80.22 MiB cached)
I trained the model using the same machine and setup. Do you have some siggestion on this issue? Thank you.
Issue Analytics
- State:
- Created 4 years ago
- Comments:11 (6 by maintainers)
Top Results From Across the Web
Running out of memory when resume training. #12680 - GitHub
I will profile the memory usage and get back to you with the findings. I'm glad to hear that meanwhile you have a...
Read more >Resuming training raises error "CUDA out of memory" - vision
It seems that there's something wrong with the loss but I cannot understand what's the problem.
Read more >Resuming training fails with CUDA out of memory error
I am using my university's HPC cluster and there is a time limit per job. So I ran the train method of the...
Read more >Resuming pytorch model training raises error “CUDA out of ...
2 : out of memory Traceback (most recent call last): File "train.py" ; in <module> train(train_loader, model, optimizer, epoch) File "train.py" ...
Read more >Resume=True fails without useful error message - RLlib - Ray
High: It blocks me to complete my task. I'm so close to finishing my training (Ray 2.0.0) and I ran out of disk...
Read more >Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start FreeTop Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Top GitHub Comments
Wait that’s 10gb with a batch size of 2? Uhh that would be a problem.
I’m definitely not getting that big of a difference on my machines, but I’ll test to see if there’s any difference on my side.
CUDA_VISIBLE_DEVICES
sets which GPU to use in the case you have multiple GPUs. If you only have one GPU, it shouldn’t matter. So uh, this is weird.My environment is:
Of the libraries that YOLACT uses, I have:
Note that you might be able to squeeze out more vram by downgrading to
torch==1.0.1
and whatever torchvision package is appropriate for that version. We noticed a jump in vram usage from 1.0.1 to 1.1.0.