question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Increase in GPU memory usage with Pytorch-Lightning

See original GitHub issue

Over the last week I have been porting my code on monocular depth estimation to Pytorch-Lightning, and everything is working perfectly. However, my models seem to require more GPU memory than before, to the point where I need to significantly decrease batch size at training time. These are the Trainer parameters I am using, and relevant versions:

FROM nvidia/cuda:10.1-devel-ubuntu18.04
ENV PYTORCH_VERSION=1.4.0
ENV TORCHVISION_VERSION=0.5.0
ENV CUDNN_VERSION=7.6.5.32-1+cuda10.1
ENV NCCL_VERSION=2.4.8-1+cuda10.1
ENV PYTORCH_LIGHTNING_VERSION=0.7.1
cfg.arch.gpus = 8
cfg.arch.num_nodes = 1
cfg.arch.num_workers = 8
cfg.arch.distributed_backend = 'ddp'
cfg.arch.amp_level = 'O0'
cfg.arch.precision = 32
cfg.arch.benchmark = True 
cfg.arch.min_epochs = 1
cfg.arch.max_epochs = 50
cfg.arch.checkpoint_callback = False
cfg.arch.callbacks = []
cfg.arch.gradient_clip_val = 0.0
cfg.arch.accumulate_grad_batches = 1
cfg.arch.val_check_interval = 1.0
cfg.arch.check_val_every_n_epoch = 1
cfg.arch.num_sanity_val_steps = 0
cfg.arch.progress_bar_refresh_rate = 1
cfg.arch.fast_dev_run = False
cfg.arch.overfit_pct = 0.0
cfg.arch.train_percent_check = 1.0
cfg.arch.val_percent_check = 1.0
cfg.arch.test_percent_check = 1.0

Because of that (probably) I am having issues replicating my results, could you please advise on possible solutions? I will open-source the code as soon as I manage to replicate current results.

Issue Analytics

  • State:closed
  • Created 3 years ago
  • Reactions:1
  • Comments:7 (7 by maintainers)

github_iconTop GitHub Comments

1reaction
williamFalconcommented, Apr 5, 2020

@jeremyjordan can we get that memory profiler? @vguizilini mind trying again from master?

0reactions
VitorGuizilinicommented, Apr 9, 2020

Following up on this issue, is there anything else I should provide to facilitate debugging?

Read more comments on GitHub >

github_iconTop Results From Across the Web

Why does pytorch lightning cause more GPU memory usage?
Assumign that my model uses 2G GPU memory, every batch data uses 3G GPU memory. Traning code will use 5G (2+3) GPU memory...
Read more >
memory — PyTorch Lightning 1.8.5.post0 documentation
A dictionary in which the keys are device ids as integers and values are memory usage as integers in MB. Raises. FileNotFoundError –...
Read more >
Memory Usage Keep Increasing During Training - vision
Hi guys, I trained my model using pytorch lightning. At the beginning, GPU memory usage is only 22%. However, after 900 steps, GPU...
Read more >
7 Tips To Maximize PyTorch Performance
7 Tips To Maximize PyTorch Performance · Use workers in DataLoaders · Pin memory · Avoid CPU to GPU transfers or vice-versa ·...
Read more >
7 Tips To Maximize PyTorch Performance | by William Falcon
Warning: The downside is that your memory usage will also increase (source). Pin memory. You know how sometimes your GPU ...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found