Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Unexpectedly high fp16 memory usage

See original GitHub issue

I’ve been noticing that using fp16 has not resulted in much difference in model size or memory usage. Using the below script (taken from your docs directly) and only changing the flag fp16=True to False yields a difference of 4% VRAM usage and exactly the same checkpoint size for both.

This seems suspiciously small compared to other projects I’ve used with fp16 enabled. And a few people on the LAION discord Imagen channel are noticing the same thing. Although others seem to notice a bigger difference as well.

Wondering if it could be a difference of training scripts, since we all seem to be using our own custom variations.

import torch
from imagen_pytorch import Unet, Imagen, SRUnet256, ImagenTrainer

unet1 = Unet(
    dim = 32,
    dim_mults = (1, 2, 4),
    num_resnet_blocks = 3,
    layer_attns = (False, True, True),
    layer_cross_attns = False,
    use_linear_attn = True
)

unet2 = SRUnet256(
    dim = 32,
    dim_mults = (1, 2, 4),
    num_resnet_blocks = (2, 4, 8),
    layer_attns = (False, False, True),
    layer_cross_attns = False
)

imagen = Imagen(
    condition_on_text = False,
    unets = (unet1, unet2),
    image_sizes = (64, 128),
    timesteps = 1000
)

trainer = ImagenTrainer(
    imagen,
    fp16=False #change this and compare model sizes/memory usage
).cuda()

training_images = torch.randn(4, 3, 256, 256).cuda()


for i in range(100):
    loss = trainer(training_images, unet_number = 1)
    trainer.update(unet_number = 1)
    
trainer.save("./checkpoint.pt")

Issue Analytics

State:
Created a year ago
Reactions:1
Comments:11 (5 by maintainers)

Top GitHub Comments

1reaction

truftycommented, Aug 22, 2022

Yea, If its working for you, I has to be a local env issue… uggh. Thanks for helping so far. (and yea I had the fp16 flag set correctly)

1reaction

truftycommented, Aug 22, 2022

Just to confirm I’m not going crazy, I interrogated the fp16 = True model from the script above, and the dtype of all layers are float32 😢

unets.0.final_conv.weight    | torch.float32
unets.0.final_conv.bias    | torch.float32
unets.1.null_text_embed    | torch.float32
unets.1.null_text_hidden    | torch.float32
...

Top Results From Across the Web

Memory and speed - Hugging Face

We present some techniques and ideas to optimize Diffusers inference for memory or speed. As a general rule, we recommend the use of...

Half The Precision, Twice The Fun: Working With FP16 In HLSL

What about use of fp16 in shared memory (LDS/TGSM) with DX? Any ideas?

BFloat16: The secret to high performance on Cloud TPUs

Bfloat16 is carefully used within systolic arrays to accelerate matrix multiplication operations on Cloud TPUs.

First Steps When Implementing FP16 - GPUOpen

Poor use of FP16 can result in excessive conversion between FP16 and FP32. This can reduce the performance advantage. FP16 gently increases code ......

Running Stable Diffusion on Your GPU with Less Than 10Gb ...

ditch miniconda please) - Use lower FP precision mode if available to use the tensor cores (also to double "effective" memory) - Batch...