prompt endoftext bug - results with prompt > max-length are way better then prompt with length <= max-length
See original GitHub issueDescribe the bug
Hi I trained a personal model (key: smnb
, class: person
) with dreambooth, and i found, that I can’t replicate the results with automatic1111 on the one side and huggingface/transformers + notebook on the other side.
the fascinating thing is, that if I use exactly max_words + 1, then my prompt is shortened with the word but also with the endoftext-token, and this creates fantastic results.
scheduler = DDIMScheduler(beta_start=0.00085, beta_end=0.012, beta_schedule="scaled_linear", clip_sample=False, set_alpha_to_one=False)
safety_checker = None
pipe = StableDiffusionPipeline.from_pretrained(model_path, torch_dtype=torch.float16, revision="fp16", scheduler=scheduler, safety_checker = safety_checker).to("cuda")
g_cuda = None
# %%
#@title Run for generating images.
g_cuda = torch.Generator(device='cuda')
seed = 1117437330 #@param {type:"number"}
g_cuda.manual_seed(seed)
prompt = "Ultrawide realistic photo of a smnb person viking men, leading a battle, battle-scarred mind-blowing details, ethereal, ominous, scarred, highly detailed, viking attire, cinematic, 16k, 1080s, smooth, sharp focus, by stanley artgermm, tom bagshaw, greg rutkowski, vincent di fate, carne griffiths, ayami kojima, trending on deviantart, hyper detailed, full of color, digital art, vibrant colors, smooth gradients, high contrast, depth of field, shot on canon camera" #@param {type:"string"}
negative_prompt = "" #@param {type:"string"}
num_samples = 1 #@param {type:"number"}
guidance_scale = 10 #@param {type:"number"}
num_inference_steps = 50 #@param {type:"number"}
height = 512 #@param {type:"number"}
width = 512 #@param {type:"number"}
with autocast("cuda"), torch.inference_mode():
images = pipe(
prompt,
height=height,
width=width,
negative_prompt=negative_prompt,
num_images_per_prompt=num_samples,
num_inference_steps=num_inference_steps,
guidance_scale=guidance_scale,
generator=g_cuda
).images
for img in images:
display(img)
this runs fine, and produces this image:
but if I add a comma to the prompt and rerun the code (of couse with same seed), I get this error but a very nice picture:
so the exact prompt is: prompt = “Ultrawide realistic photo of a smnb person viking men, leading a battle, battle-scarred mind-blowing details, ethereal, ominous, scarred, highly detailed, viking attire, cinematic, 16k, 1080s, smooth, sharp focus, by stanley artgermm, tom bagshaw, greg rutkowski, vincent di fate, carne griffiths, ayami kojima, trending on deviantart, hyper detailed, full of color, digital art, vibrant colors, smooth gradients, high contrast, depth of field, shot on canon camera,”
–> (“,” is the last character)
The following part of your input was truncated because CLIP can only handle sequences up to 77 tokens: ['<|endoftext|>']
(it’s even so similar to the real person, that I hide the eyes here 😃 )
if you think, this is by accident, let’s compare another seed: seed = 1117437320 prompt with comma (and endoftext cutoff?)
without comma (no cut-off, prompt length = ok):
seed 1117437334: with comma prompt:
without comma (no cut-off, prompt length = ok):
this goes on and on, of course this is not because of the “,” itself, but I think that the forward-step is better if the endoftext prompt is cut-off, can someone prove these results with other trained models?
i compared the prompts with 20 different seeds, of course this is not a very big “study”, but in my case: 6-7 results from 20 in the with-comma-edition were really good and you can with high similarity to the original picture/person.
in the non-comma-edition, I would say that none of the 20 pictures looked like the original.
maybe related issues: https://github.com/facebookresearch/SLIP/issues/18
related code: https://github.com/huggingface/diffusers/blob/main/src/diffusers/pipelines/stable_diffusion/pipeline_stable_diffusion.py#L278
Reproduction
- Train
keyword person
model with dreambooth - create images with maximal prompt length + 1 sign (“,”)
- compare it to prompt up to max length or less.
You can also add additional clips, which also get cut-off:
I added ",a,b,c"
prompt = "[identical from above] carne griffiths,a,b,c"
which resulted in this message (and again the great output/pictures):
The following part of your input was truncated because CLIP can only handle sequences up to 77 tokens: ['a, b, c <|endoftext|>']
In any case, this is not the same like putting the prompt to:
prompt = "[identical from above] carne griffiths"
which most people would assume, but the prompt-cutoff cuts the flagword “endoftext” off (I assume).
I replicated in locally with these packages in windows + conda, but I get also replicate it exactly on google colab. Perfect replication is just for me possible, because I have my trained model, but if this is a bug, you could replicate it, with some real person pictures like I did.
Logs
No response
System Info
important package list:
diffusers==0.6.0 torch==1.12.1 torch-fidelity==0.3.0 torchaudio==0.12.1 torchmetrics==0.6.0 torchvision==0.13.1 transformers==4.18.0
python==3.8.10
Issue Analytics
- State:
- Created 10 months ago
- Comments:5 (1 by maintainers)
Top GitHub Comments
1.) Very good find the cut-off indeed does not behave as it should! We should always add a EOS to the end. Will open a PR to fix this!
2.) Interesting, not sure if this mean we should change anything
You can also use the default pipeline, so this is also a bug regarding the text cut-off:
Using prompt1
"Ultrawide realistic photo of a smnb person viking men, leading a battle, battle-scarred mind-blowing details, ethereal, ominous, scarred, highly detailed, viking attire, cinematic, 16k, 1080s, smooth, sharp focus, by stanley artgermm, tom bagshaw, greg rutkowski, vincent di fate, carne griffiths,"
should have the exact same result like prompt2:
"Ultrawide realistic photo of a smnb person viking men, leading a battle, battle-scarred mind-blowing details, ethereal, ominous, scarred, highly detailed, viking attire, cinematic, 16k, 1080s, smooth, sharp focus, by stanley artgermm, tom bagshaw, greg rutkowski, vincent di fate, carne griffiths"
because the “,” is too much, the code cuts the “,” AND the EOT-flag, so the results are not the same/deterministic as they should. [“token73, token74, token75”] : ok -> + EOT flag [“token73, token74, token75,”] : to much tokens -> without EOT flag
In summary, we found two properties: