Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Out of memory at later stage of training

See original GitHub issue

Hello,

I observed some strange behavior when launching a training on a server of 4 16GB P100 GPUs using:

python -m torch.distributed.launch --nproc_per_node=4 --use_env main.py --coco_path /path/to/coco

The training went well for 12 epochs and then in the middle of the 13th epoch, it had an OOM error. Usually memory usage shouldn’t change between epochs, but for DETR I don’t know if this is the case.

According to the paper, you trained your models using “16 V100 GPUs, with 4 images per GPU (hence a total batch size of 64)”. Could you tell me if your GPUs have 16GB or 32GB of memory?

Thanks a lot!

Issue Analytics

State:
Created 3 years ago
Reactions:3
Comments:7 (4 by maintainers)

Top GitHub Comments

1reaction

lessw2020commented, Jul 23, 2020

If it helps, I’ve monitored memory quite a bit while training and you can see a bit of climb during first epoch and after that it’s largely stable within +/- .1GB (i.e. 10.1GB). For the r50 model it’s around 10.x GB with bs 2, and 11.x with BS4 on my training. Anyway, no memory issues in my experience. Note you might want to make sure to run nvidia-smi (in notebook = !nvidia-smi) and check out your gpu free memory before starting to confirm what’s free.

1reaction

alcinoscommented, Jul 17, 2020

Hi @netw0rkf10w Thank you for your interest in DETR.

Memory usage variation mainly come from padding. If by chance, you get in the same batch a wide horizontal image and a narrow vertical image, then the padded resulting images will be huge (and result in quite a bit of wasted memory). The trainings are seeded, so presumably you will always encounter this “bad batch” at the same time in your 13 epoch.

That being said, your command seems to be using the default batch-size which is 2 per card, it should amply fit in a 16gb card. Could you double check that no other process is using gpu memory on the node? Also the logs should provide you with the current “max mem” that has been required, so you can check whether it’s constantly flirting with 16gb (it shouldn’t with batch-size=2)

Best of luck.