Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

8 GB Insufficient to Train Image Size of 512?

See original GitHub issue

Hello again @lucidrains & StyleGANers!

I tried training using --network-capacity 10 --attn-layers 1 --batch-size 1 gradient-accumulate-every 32 --image-size 512 on a Nvidia 2070 Super with 8 GB of GDDR. The program trains for about 5000 iterations in just under 30 hours but suddenly crashes with the error

RuntimeError: CUDA out of memory. Tried to allocate 128.00 MiB (GPU 0; 7.80 GiB total capacity; 6.53 GiB already allocated; 71.00 MiB free; 6.92 GiB reserved in total by PyTorch)

Tried resuming the training but got a similar error within 24 iterations:

RuntimeError: CUDA out of memory. Tried to allocate 128.00 MiB (GPU 0; 7.80 GiB total capacity; 6.54 GiB already allocated; 59.00 MiB free; 6.94 GiB reserved in total by PyTorch)

Using --fp16 does not seem to reduce the GPU memory usage, seems to be slightly slower than without fp16, and tends to give a NaN error during my limited tries with it.

Is it possible to let PyTorch reserve more than 6.92 GiB of memory? The card has an additional 0.9 GiB memory for PyTorch to use. When PyTorch is not running, nvidia-smi reports 1 MB of memory usage on this card.
If not, what parameters do you suggest to change so that we can continue training the model with --image-size 512? Batch size is already 1, and network-capacity is already quite low compared to the default values.

Thank you!

Issue Analytics

State:
Created 3 years ago
Comments:5 (1 by maintainers)

Top GitHub Comments

3reactions

truftycommented, Dec 5, 2020

A little late, but wanted to drop in and say I’m running fine at 512x512 with a GTX 1080 8GB vram with version 1.5.1 of this repository. Currently on 39k iterations. The ram usage is holding steady at 7.5GB for me, so its certainly cutting it close.

I’m using the following params: --name 21k-512-aug-fp16-1.5.1 --data ../stylegan2/dataset --image-size 512 --fp16 --aug-prob 0.3 --aug-types [translation,cutout,color] --top-k-training --calculate-fid-every 5000 --batch-size 3 --gradient-accumulate-every 8

Also fmap_max seems to use up a good chunk of memory. I set it lower in cases where I was seeing OOM errors.

Hope it helps

1reaction

GLivshitscommented, Jun 22, 2021

I have found the same for me: training crashes at 5024 iterations. This is caused by automatic enabling of additional loss, namely path regularization (PL). So you should choose the batch size appropriately, keeping in mind that more memory will be consumed later. I’ve had 7 GB consumption in total before 5024 iters, and got 4 GB more after 😃 Image size 128, Batch size 12, grad accumulation 4.