question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Out of memory at later stage of training

See original GitHub issue

Hello,

I observed some strange behavior when launching a training on a server of 4 16GB P100 GPUs using:

python -m torch.distributed.launch --nproc_per_node=4 --use_env main.py --coco_path /path/to/coco

The training went well for 12 epochs and then in the middle of the 13th epoch, it had an OOM error. Usually memory usage shouldn’t change between epochs, but for DETR I don’t know if this is the case.

According to the paper, you trained your models using “16 V100 GPUs, with 4 images per GPU (hence a total batch size of 64)”. Could you tell me if your GPUs have 16GB or 32GB of memory?

Thanks a lot!

Issue Analytics

  • State:open
  • Created 3 years ago
  • Reactions:3
  • Comments:7 (4 by maintainers)

github_iconTop GitHub Comments

1reaction
lessw2020commented, Jul 23, 2020

If it helps, I’ve monitored memory quite a bit while training and you can see a bit of climb during first epoch and after that it’s largely stable within +/- .1GB (i.e. 10.1GB). For the r50 model it’s around 10.x GB with bs 2, and 11.x with BS4 on my training. Anyway, no memory issues in my experience. Note you might want to make sure to run nvidia-smi (in notebook = !nvidia-smi) and check out your gpu free memory before starting to confirm what’s free.

1reaction
alcinoscommented, Jul 17, 2020

Hi @netw0rkf10w Thank you for your interest in DETR.

Memory usage variation mainly come from padding. If by chance, you get in the same batch a wide horizontal image and a narrow vertical image, then the padded resulting images will be huge (and result in quite a bit of wasted memory). The trainings are seeded, so presumably you will always encounter this “bad batch” at the same time in your 13 epoch.

That being said, your command seems to be using the default batch-size which is 2 per card, it should amply fit in a 16gb card. Could you double check that no other process is using gpu memory on the node? Also the logs should provide you with the current “max mem” that has been required, so you can check whether it’s constantly flirting with 16gb (it shouldn’t with batch-size=2)

Best of luck.

Read more comments on GitHub >

github_iconTop Results From Across the Web

Effect of memory impairment on training outcomes in ACTIVE
Cognitive training improves mental abilities in older adults, but the trainability of persons with memory impairment is unclear.
Read more >
Runtime error: CUDA out of memory by the end of training and ...
The problem is your loss_train list, which stores all losses from the beginning of your experiment. If the losses you put in were...
Read more >
Out of memory error during evaluation but training works fine!
Surprisingly my old programs are throwing an out of memory error during evaluation (in eval() mode) but training works just fine.
Read more >
Resolving CUDA Being Out of Memory With Gradient ...
So when you try to execute the training, and you don't have enough free CUDA memory available, then the framework you're using throws...
Read more >
The later stage of dementia | Alzheimer's Society
What are the symptoms of later-stage dementia? · Memory problems · Language difficulties · Changes in mood, emotions and perceptions · Changes in...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found