question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

8 GB Insufficient to Train Image Size of 512?

See original GitHub issue

Hello again @lucidrains & StyleGANers!

I tried training using --network-capacity 10 --attn-layers 1 --batch-size 1 gradient-accumulate-every 32 --image-size 512 on a Nvidia 2070 Super with 8 GB of GDDR. The program trains for about 5000 iterations in just under 30 hours but suddenly crashes with the error

RuntimeError: CUDA out of memory. Tried to allocate 128.00 MiB (GPU 0; 7.80 GiB total capacity; 6.53 GiB already allocated; 71.00 MiB free; 6.92 GiB reserved in total by PyTorch)

Tried resuming the training but got a similar error within 24 iterations:

RuntimeError: CUDA out of memory. Tried to allocate 128.00 MiB (GPU 0; 7.80 GiB total capacity; 6.54 GiB already allocated; 59.00 MiB free; 6.94 GiB reserved in total by PyTorch)

Using --fp16 does not seem to reduce the GPU memory usage, seems to be slightly slower than without fp16, and tends to give a NaN error during my limited tries with it.

  1. Is it possible to let PyTorch reserve more than 6.92 GiB of memory? The card has an additional 0.9 GiB memory for PyTorch to use. When PyTorch is not running, nvidia-smi reports 1 MB of memory usage on this card.

  2. If not, what parameters do you suggest to change so that we can continue training the model with --image-size 512? Batch size is already 1, and network-capacity is already quite low compared to the default values.

Thank you!

Issue Analytics

  • State:open
  • Created 3 years ago
  • Comments:5 (1 by maintainers)

github_iconTop GitHub Comments

3reactions
truftycommented, Dec 5, 2020

A little late, but wanted to drop in and say I’m running fine at 512x512 with a GTX 1080 8GB vram with version 1.5.1 of this repository. Currently on 39k iterations. The ram usage is holding steady at 7.5GB for me, so its certainly cutting it close.

I’m using the following params: --name 21k-512-aug-fp16-1.5.1 --data ../stylegan2/dataset --image-size 512 --fp16 --aug-prob 0.3 --aug-types [translation,cutout,color] --top-k-training --calculate-fid-every 5000 --batch-size 3 --gradient-accumulate-every 8

Also fmap_max seems to use up a good chunk of memory. I set it lower in cases where I was seeing OOM errors.

Hope it helps

1reaction
GLivshitscommented, Jun 22, 2021

I have found the same for me: training crashes at 5024 iterations. This is caused by automatic enabling of additional loss, namely path regularization (PL). So you should choose the batch size appropriately, keeping in mind that more memory will be consumed later. I’ve had 7 GB consumption in total before 5024 iters, and got 4 GB more after 😃 Image size 128, Batch size 12, grad accumulation 4.

Read more comments on GitHub >

github_iconTop Results From Across the Web

CUDA out of memory · Issue #39 · CompVis/stable-diffusion
try using --W 256 --H 256 as part of you prompt. the default image size is 512x512, which may be the reason why...
Read more >
Create ML 2.0 - Not enough disk sp… | Apple Developer Forums
The training dataset is relatively big: around 200.000 images taking 20GB disk space, 800 classes. The images are all stored on an external...
Read more >
win32diskimager not enough space - Forum - Raspberry Pi
I have a 8gb sd card with raspian image. ... However when I try to burn same image back to same sd card,...
Read more >
Efficient Training on a Single GPU - Hugging Face
In total we get 512 sequences each with length 512 and store them in a Dataset ... We can see that the model...
Read more >
Does the input image size affect CNN's performance ... - Quora
Resize your image from 512 x 512 to 224 x 224. Of course, this causes loss of information, but that happens anyway in...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found