8 GB Insufficient to Train Image Size of 512?
See original GitHub issueHello again @lucidrains & StyleGANers!
I tried training using --network-capacity 10 --attn-layers 1 --batch-size 1 gradient-accumulate-every 32 --image-size 512
on a Nvidia 2070 Super with 8 GB of GDDR. The program trains for about 5000 iterations in just under 30 hours but suddenly crashes with the error
RuntimeError: CUDA out of memory. Tried to allocate 128.00 MiB (GPU 0; 7.80 GiB total capacity; 6.53 GiB already allocated; 71.00 MiB free; 6.92 GiB reserved in total by PyTorch)
Tried resuming the training but got a similar error within 24 iterations:
RuntimeError: CUDA out of memory. Tried to allocate 128.00 MiB (GPU 0; 7.80 GiB total capacity; 6.54 GiB already allocated; 59.00 MiB free; 6.94 GiB reserved in total by PyTorch)
Using --fp16
does not seem to reduce the GPU memory usage, seems to be slightly slower than without fp16, and tends to give a NaN
error during my limited tries with it.
-
Is it possible to let PyTorch reserve more than 6.92 GiB of memory? The card has an additional 0.9 GiB memory for PyTorch to use. When PyTorch is not running,
nvidia-smi
reports 1 MB of memory usage on this card. -
If not, what parameters do you suggest to change so that we can continue training the model with
--image-size 512
? Batch size is already 1, and network-capacity is already quite low compared to the default values.
Thank you!
Issue Analytics
- State:
- Created 3 years ago
- Comments:5 (1 by maintainers)
Top GitHub Comments
A little late, but wanted to drop in and say I’m running fine at 512x512 with a GTX 1080 8GB vram with version 1.5.1 of this repository. Currently on 39k iterations. The ram usage is holding steady at 7.5GB for me, so its certainly cutting it close.
I’m using the following params:
--name 21k-512-aug-fp16-1.5.1 --data ../stylegan2/dataset --image-size 512 --fp16 --aug-prob 0.3 --aug-types [translation,cutout,color] --top-k-training --calculate-fid-every 5000 --batch-size 3 --gradient-accumulate-every 8
Also
fmap_max
seems to use up a good chunk of memory. I set it lower in cases where I was seeing OOM errors.Hope it helps
I have found the same for me: training crashes at 5024 iterations. This is caused by automatic enabling of additional loss, namely path regularization (PL). So you should choose the batch size appropriately, keeping in mind that more memory will be consumed later. I’ve had 7 GB consumption in total before 5024 iters, and got 4 GB more after 😃 Image size 128, Batch size 12, grad accumulation 4.