Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

prompt endoftext bug - results with prompt > max-length are way better then prompt with length <= max-length

See original GitHub issue

Describe the bug

Hi I trained a personal model (key: smnb, class: person) with dreambooth, and i found, that I can’t replicate the results with automatic1111 on the one side and huggingface/transformers + notebook on the other side.

the fascinating thing is, that if I use exactly max_words + 1, then my prompt is shortened with the word but also with the endoftext-token, and this creates fantastic results.

scheduler = DDIMScheduler(beta_start=0.00085, beta_end=0.012, beta_schedule="scaled_linear", clip_sample=False, set_alpha_to_one=False)
safety_checker = None

pipe = StableDiffusionPipeline.from_pretrained(model_path, torch_dtype=torch.float16, revision="fp16", scheduler=scheduler, safety_checker = safety_checker).to("cuda")
g_cuda = None

# %%
#@title Run for generating images.
g_cuda = torch.Generator(device='cuda')
seed = 1117437330 #@param {type:"number"}
g_cuda.manual_seed(seed)

prompt = "Ultrawide realistic photo of a smnb person viking men, leading a battle, battle-scarred mind-blowing details, ethereal, ominous, scarred, highly detailed, viking attire, cinematic, 16k, 1080s, smooth, sharp focus, by stanley artgermm, tom bagshaw, greg rutkowski, vincent di fate, carne griffiths, ayami kojima, trending on deviantart, hyper detailed, full of color, digital art, vibrant colors, smooth gradients, high contrast, depth of field, shot on canon camera" #@param {type:"string"}
negative_prompt = "" #@param {type:"string"}
num_samples = 1 #@param {type:"number"}
guidance_scale = 10 #@param {type:"number"}
num_inference_steps = 50 #@param {type:"number"}
height = 512 #@param {type:"number"}
width = 512 #@param {type:"number"}

with autocast("cuda"), torch.inference_mode():
    images = pipe(
        prompt,
        height=height,
        width=width,
        negative_prompt=negative_prompt,
        num_images_per_prompt=num_samples,
        num_inference_steps=num_inference_steps,
        guidance_scale=guidance_scale,
        generator=g_cuda
    ).images

for img in images:
    display(img)

this runs fine, and produces this image: grafik

but if I add a comma to the prompt and rerun the code (of couse with same seed), I get this error but a very nice picture:

so the exact prompt is: prompt = “Ultrawide realistic photo of a smnb person viking men, leading a battle, battle-scarred mind-blowing details, ethereal, ominous, scarred, highly detailed, viking attire, cinematic, 16k, 1080s, smooth, sharp focus, by stanley artgermm, tom bagshaw, greg rutkowski, vincent di fate, carne griffiths, ayami kojima, trending on deviantart, hyper detailed, full of color, digital art, vibrant colors, smooth gradients, high contrast, depth of field, shot on canon camera,”

–> (“,” is the last character)

The following part of your input was truncated because CLIP can only handle sequences up to 77 tokens: ['<|endoftext|>']

grafik

(it’s even so similar to the real person, that I hide the eyes here 😃 )

if you think, this is by accident, let’s compare another seed: seed = 1117437320 prompt with comma (and endoftext cutoff?) grafik

without comma (no cut-off, prompt length = ok): grafik

seed 1117437334: with comma prompt: grafik

without comma (no cut-off, prompt length = ok): grafik

this goes on and on, of course this is not because of the “,” itself, but I think that the forward-step is better if the endoftext prompt is cut-off, can someone prove these results with other trained models?

i compared the prompts with 20 different seeds, of course this is not a very big “study”, but in my case: 6-7 results from 20 in the with-comma-edition were really good and you can with high similarity to the original picture/person.

in the non-comma-edition, I would say that none of the 20 pictures looked like the original.

maybe related issues: https://github.com/facebookresearch/SLIP/issues/18

Reproduction

Train keyword person model with dreambooth
create images with maximal prompt length + 1 sign (“,”)
compare it to prompt up to max length or less.

You can also add additional clips, which also get cut-off:

I added ",a,b,c" prompt = "[identical from above] carne griffiths,a,b,c" which resulted in this message (and again the great output/pictures): The following part of your input was truncated because CLIP can only handle sequences up to 77 tokens: ['a, b, c <|endoftext|>']

In any case, this is not the same like putting the prompt to: prompt = "[identical from above] carne griffiths" which most people would assume, but the prompt-cutoff cuts the flagword “endoftext” off (I assume).

I replicated in locally with these packages in windows + conda, but I get also replicate it exactly on google colab. Perfect replication is just for me possible, because I have my trained model, but if this is a bug, you could replicate it, with some real person pictures like I did.

Logs

No response

System Info

important package list:

diffusers==0.6.0 torch==1.12.1 torch-fidelity==0.3.0 torchaudio==0.12.1 torchmetrics==0.6.0 torchvision==0.13.1 transformers==4.18.0

python==3.8.10

Issue Analytics

State:
Created 10 months ago
Comments:5 (1 by maintainers)

Top GitHub Comments

2reactions

patrickvonplatencommented, Nov 9, 2022

1.) Very good find the cut-off indeed does not behave as it should! We should always add a EOS to the end. Will open a PR to fix this!

2.) Interesting, not sure if this mean we should change anything

2reactions

petekaycommented, Nov 7, 2022

You can also use the default pipeline, so this is also a bug regarding the text cut-off:

    pipe = StableDiffusionPipeline.from_pretrained(
        "CompVis/stable-diffusion-v1-4", 
        use_auth_token=True
    ).to("cuda")

Using prompt1 "Ultrawide realistic photo of a smnb person viking men, leading a battle, battle-scarred mind-blowing details, ethereal, ominous, scarred, highly detailed, viking attire, cinematic, 16k, 1080s, smooth, sharp focus, by stanley artgermm, tom bagshaw, greg rutkowski, vincent di fate, carne griffiths,"

should have the exact same result like prompt2: "Ultrawide realistic photo of a smnb person viking men, leading a battle, battle-scarred mind-blowing details, ethereal, ominous, scarred, highly detailed, viking attire, cinematic, 16k, 1080s, smooth, sharp focus, by stanley artgermm, tom bagshaw, greg rutkowski, vincent di fate, carne griffiths"

because the “,” is too much, the code cuts the “,” AND the EOT-flag, so the results are not the same/deterministic as they should. [“token73, token74, token75”] : ok -> + EOT flag [“token73, token74, token75,”] : to much tokens -> without EOT flag

In summary, we found two properties:

the cut-off behaves not like expected, because its cut’s the EOT-flag also
results based on shortened prompts by the CLIPer, are looking better (subjective opinion)