Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Explanation of the 0.18215 factor in textual_inversion?

See original GitHub issue

https://github.com/huggingface/diffusers/blob/b2b3b1a8ab83b020ecaf32f45de3ef23644331cf/examples/textual_inversion/textual_inversion.py#L501

Hi, just a small question about the quoted script above which is bothering me: where does this 0.18215 number come from? What computation is being done? Is it from some paper? I have seen the same factor elsewhere, too, without explanation. Any guidance would be very helpful, thanks!

Issue Analytics

State:
Created a year ago
Comments:7 (1 by maintainers)

Top GitHub Comments

8reactions

rrombcommented, Sep 9, 2022

Hi @garrett361 @patil-suraj @CodeExplode

We introduced the scale factor in the latent diffusion paper. The goal was to handle different latent spaces (from different autoencoders, which can be scaled quite differently than images) with similar noise schedules. The scale_factor ensures that the initial latent space on which the diffusion model is operating has approximately unit variance. Hope this helps 😃

3reactions

fepegarcommented, Dec 19, 2022

In case this is useful for others, I’ve written some code to replicate the computation of that magic value. It seems to be a reasonable estimation!

from diffusers import AutoencoderKL
import torch
import torchvision
from torchvision.datasets.utils import download_and_extract_archive
from torchvision import transforms


num_workers = 4
batch_size = 12
# From https://github.com/fastai/imagenette
IMAGENETTE_URL = 'https://s3.amazonaws.com/fast-ai-imageclas/imagenette2.tgz'

torch.manual_seed(0)
torch.set_grad_enabled(False)

device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')

pretrained_model_name_or_path = 'CompVis/stable-diffusion-v1-4'
vae = AutoencoderKL.from_pretrained(
    pretrained_model_name_or_path,
    subfolder='vae',
    revision=None,
)
vae.to(device)

size = 512
image_transform = transforms.Compose([
    transforms.Resize(size),
    transforms.CenterCrop(size),
    transforms.ToTensor(),
    transforms.Normalize([0.5], [0.5]),
])

root = 'dataset'
download_and_extract_archive(IMAGENETTE_URL, root)

dataset = torchvision.datasets.ImageFolder(root, transform=image_transform)
loader = torch.utils.data.DataLoader(
    dataset,
    batch_size=batch_size,
    shuffle=True,
    num_workers=num_workers,
)

all_latents = []
for image_data, _ in loader:
    image_data = image_data.to(device)
    latents = vae.encode(image_data).latent_dist.sample()
    all_latents.append(latents.cpu())

all_latents_tensor = torch.cat(all_latents)
std = all_latents_tensor.std().item()
normalizer = 1 / std
print(f'{normalizer = }')

Output: