Unexpectedly high fp16 memory usage
See original GitHub issueI’ve been noticing that using fp16 has not resulted in much difference in model size or memory usage. Using the below script (taken from your docs directly) and only changing the flag fp16=True
to False
yields a difference of 4% VRAM usage and exactly the same checkpoint size for both.
This seems suspiciously small compared to other projects I’ve used with fp16 enabled. And a few people on the LAION discord Imagen channel are noticing the same thing. Although others seem to notice a bigger difference as well.
Wondering if it could be a difference of training scripts, since we all seem to be using our own custom variations.
import torch
from imagen_pytorch import Unet, Imagen, SRUnet256, ImagenTrainer
unet1 = Unet(
dim = 32,
dim_mults = (1, 2, 4),
num_resnet_blocks = 3,
layer_attns = (False, True, True),
layer_cross_attns = False,
use_linear_attn = True
)
unet2 = SRUnet256(
dim = 32,
dim_mults = (1, 2, 4),
num_resnet_blocks = (2, 4, 8),
layer_attns = (False, False, True),
layer_cross_attns = False
)
imagen = Imagen(
condition_on_text = False,
unets = (unet1, unet2),
image_sizes = (64, 128),
timesteps = 1000
)
trainer = ImagenTrainer(
imagen,
fp16=False #change this and compare model sizes/memory usage
).cuda()
training_images = torch.randn(4, 3, 256, 256).cuda()
for i in range(100):
loss = trainer(training_images, unet_number = 1)
trainer.update(unet_number = 1)
trainer.save("./checkpoint.pt")
Issue Analytics
- State:
- Created a year ago
- Reactions:1
- Comments:11 (5 by maintainers)
Top Results From Across the Web
Memory and speed - Hugging Face
We present some techniques and ideas to optimize Diffusers inference for memory or speed. As a general rule, we recommend the use of...
Read more >Half The Precision, Twice The Fun: Working With FP16 In HLSL
What about use of fp16 in shared memory (LDS/TGSM) with DX? Any ideas?
Read more >BFloat16: The secret to high performance on Cloud TPUs
Bfloat16 is carefully used within systolic arrays to accelerate matrix multiplication operations on Cloud TPUs.
Read more >First Steps When Implementing FP16 - GPUOpen
Poor use of FP16 can result in excessive conversion between FP16 and FP32. This can reduce the performance advantage. FP16 gently increases code ......
Read more >Running Stable Diffusion on Your GPU with Less Than 10Gb ...
ditch miniconda please) - Use lower FP precision mode if available to use the tensor cores (also to double "effective" memory) - Batch...
Read more >Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start FreeTop Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Top GitHub Comments
Yea, If its working for you, I has to be a local env issue… uggh. Thanks for helping so far. (and yea I had the fp16 flag set correctly)
Just to confirm I’m not going crazy, I interrogated the
fp16 = True
model from the script above, and the dtype of all layers are float32 😢