Out of memory at later stage of training
See original GitHub issueHello,
I observed some strange behavior when launching a training on a server of 4 16GB P100 GPUs using:
python -m torch.distributed.launch --nproc_per_node=4 --use_env main.py --coco_path /path/to/coco
The training went well for 12 epochs and then in the middle of the 13th epoch, it had an OOM error. Usually memory usage shouldn’t change between epochs, but for DETR I don’t know if this is the case.
According to the paper, you trained your models using “16 V100 GPUs, with 4 images per GPU (hence a total batch size of 64)”. Could you tell me if your GPUs have 16GB or 32GB of memory?
Thanks a lot!
Issue Analytics
- State:
- Created 3 years ago
- Reactions:3
- Comments:7 (4 by maintainers)
Top Results From Across the Web
Effect of memory impairment on training outcomes in ACTIVE
Cognitive training improves mental abilities in older adults, but the trainability of persons with memory impairment is unclear.
Read more >Runtime error: CUDA out of memory by the end of training and ...
The problem is your loss_train list, which stores all losses from the beginning of your experiment. If the losses you put in were...
Read more >Out of memory error during evaluation but training works fine!
Surprisingly my old programs are throwing an out of memory error during evaluation (in eval() mode) but training works just fine.
Read more >Resolving CUDA Being Out of Memory With Gradient ...
So when you try to execute the training, and you don't have enough free CUDA memory available, then the framework you're using throws...
Read more >The later stage of dementia | Alzheimer's Society
What are the symptoms of later-stage dementia? · Memory problems · Language difficulties · Changes in mood, emotions and perceptions · Changes in...
Read more >Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start FreeTop Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Top GitHub Comments
If it helps, I’ve monitored memory quite a bit while training and you can see a bit of climb during first epoch and after that it’s largely stable within +/- .1GB (i.e. 10.1GB). For the r50 model it’s around 10.x GB with bs 2, and 11.x with BS4 on my training. Anyway, no memory issues in my experience. Note you might want to make sure to run nvidia-smi (in notebook = !nvidia-smi) and check out your gpu free memory before starting to confirm what’s free.
Hi @netw0rkf10w Thank you for your interest in DETR.
Memory usage variation mainly come from padding. If by chance, you get in the same batch a wide horizontal image and a narrow vertical image, then the padded resulting images will be huge (and result in quite a bit of wasted memory). The trainings are seeded, so presumably you will always encounter this “bad batch” at the same time in your 13 epoch.
That being said, your command seems to be using the default batch-size which is 2 per card, it should amply fit in a 16gb card. Could you double check that no other process is using gpu memory on the node? Also the logs should provide you with the current “max mem” that has been required, so you can check whether it’s constantly flirting with 16gb (it shouldn’t with batch-size=2)
Best of luck.