question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

prompt endoftext bug - results with prompt > max-length are way better then prompt with length <= max-length

See original GitHub issue

Describe the bug

Hi I trained a personal model (key: smnb, class: person) with dreambooth, and i found, that I can’t replicate the results with automatic1111 on the one side and huggingface/transformers + notebook on the other side.

the fascinating thing is, that if I use exactly max_words + 1, then my prompt is shortened with the word but also with the endoftext-token, and this creates fantastic results.

scheduler = DDIMScheduler(beta_start=0.00085, beta_end=0.012, beta_schedule="scaled_linear", clip_sample=False, set_alpha_to_one=False)
safety_checker = None

pipe = StableDiffusionPipeline.from_pretrained(model_path, torch_dtype=torch.float16, revision="fp16", scheduler=scheduler, safety_checker = safety_checker).to("cuda")
g_cuda = None

# %%
#@title Run for generating images.
g_cuda = torch.Generator(device='cuda')
seed = 1117437330 #@param {type:"number"}
g_cuda.manual_seed(seed)

prompt = "Ultrawide realistic photo of a smnb person viking men, leading a battle, battle-scarred mind-blowing details, ethereal, ominous, scarred, highly detailed, viking attire, cinematic, 16k, 1080s, smooth, sharp focus, by stanley artgermm, tom bagshaw, greg rutkowski, vincent di fate, carne griffiths, ayami kojima, trending on deviantart, hyper detailed, full of color, digital art, vibrant colors, smooth gradients, high contrast, depth of field, shot on canon camera" #@param {type:"string"}
negative_prompt = "" #@param {type:"string"}
num_samples = 1 #@param {type:"number"}
guidance_scale = 10 #@param {type:"number"}
num_inference_steps = 50 #@param {type:"number"}
height = 512 #@param {type:"number"}
width = 512 #@param {type:"number"}

with autocast("cuda"), torch.inference_mode():
    images = pipe(
        prompt,
        height=height,
        width=width,
        negative_prompt=negative_prompt,
        num_images_per_prompt=num_samples,
        num_inference_steps=num_inference_steps,
        guidance_scale=guidance_scale,
        generator=g_cuda
    ).images

for img in images:
    display(img)

this runs fine, and produces this image: grafik

but if I add a comma to the prompt and rerun the code (of couse with same seed), I get this error but a very nice picture:

so the exact prompt is: prompt = “Ultrawide realistic photo of a smnb person viking men, leading a battle, battle-scarred mind-blowing details, ethereal, ominous, scarred, highly detailed, viking attire, cinematic, 16k, 1080s, smooth, sharp focus, by stanley artgermm, tom bagshaw, greg rutkowski, vincent di fate, carne griffiths, ayami kojima, trending on deviantart, hyper detailed, full of color, digital art, vibrant colors, smooth gradients, high contrast, depth of field, shot on canon camera,”

–> (“,” is the last character)

The following part of your input was truncated because CLIP can only handle sequences up to 77 tokens: ['<|endoftext|>']

grafik

(it’s even so similar to the real person, that I hide the eyes here 😃 )

if you think, this is by accident, let’s compare another seed: seed = 1117437320 prompt with comma (and endoftext cutoff?) grafik

without comma (no cut-off, prompt length = ok): grafik

seed 1117437334: with comma prompt: grafik

without comma (no cut-off, prompt length = ok): grafik

this goes on and on, of course this is not because of the “,” itself, but I think that the forward-step is better if the endoftext prompt is cut-off, can someone prove these results with other trained models?

i compared the prompts with 20 different seeds, of course this is not a very big “study”, but in my case: 6-7 results from 20 in the with-comma-edition were really good and you can with high similarity to the original picture/person.

in the non-comma-edition, I would say that none of the 20 pictures looked like the original.

maybe related issues: https://github.com/facebookresearch/SLIP/issues/18

related code: https://github.com/huggingface/diffusers/blob/main/src/diffusers/pipelines/stable_diffusion/pipeline_stable_diffusion.py#L278

Reproduction

  1. Train keyword person model with dreambooth
  2. create images with maximal prompt length + 1 sign (“,”)
  3. compare it to prompt up to max length or less.

You can also add additional clips, which also get cut-off:

I added ",a,b,c" prompt = "[identical from above] carne griffiths,a,b,c" which resulted in this message (and again the great output/pictures): The following part of your input was truncated because CLIP can only handle sequences up to 77 tokens: ['a, b, c <|endoftext|>']

In any case, this is not the same like putting the prompt to: prompt = "[identical from above] carne griffiths" which most people would assume, but the prompt-cutoff cuts the flagword “endoftext” off (I assume).

I replicated in locally with these packages in windows + conda, but I get also replicate it exactly on google colab. Perfect replication is just for me possible, because I have my trained model, but if this is a bug, you could replicate it, with some real person pictures like I did.

Logs

No response

System Info

important package list:

diffusers==0.6.0 torch==1.12.1 torch-fidelity==0.3.0 torchaudio==0.12.1 torchmetrics==0.6.0 torchvision==0.13.1 transformers==4.18.0

python==3.8.10

Issue Analytics

  • State:closed
  • Created 10 months ago
  • Comments:5 (1 by maintainers)

github_iconTop GitHub Comments

2reactions
patrickvonplatencommented, Nov 9, 2022

1.) Very good find the cut-off indeed does not behave as it should! We should always add a EOS to the end. Will open a PR to fix this!

2.) Interesting, not sure if this mean we should change anything

2reactions
petekaycommented, Nov 7, 2022

You can also use the default pipeline, so this is also a bug regarding the text cut-off:

    pipe = StableDiffusionPipeline.from_pretrained(
        "CompVis/stable-diffusion-v1-4", 
        use_auth_token=True
    ).to("cuda")

Using prompt1 "Ultrawide realistic photo of a smnb person viking men, leading a battle, battle-scarred mind-blowing details, ethereal, ominous, scarred, highly detailed, viking attire, cinematic, 16k, 1080s, smooth, sharp focus, by stanley artgermm, tom bagshaw, greg rutkowski, vincent di fate, carne griffiths,"

should have the exact same result like prompt2: "Ultrawide realistic photo of a smnb person viking men, leading a battle, battle-scarred mind-blowing details, ethereal, ominous, scarred, highly detailed, viking attire, cinematic, 16k, 1080s, smooth, sharp focus, by stanley artgermm, tom bagshaw, greg rutkowski, vincent di fate, carne griffiths"

because the “,” is too much, the code cuts the “,” AND the EOT-flag, so the results are not the same/deterministic as they should. [“token73, token74, token75”] : ok -> + EOT flag [“token73, token74, token75,”] : to much tokens -> without EOT flag

In summary, we found two properties:

  1. the cut-off behaves not like expected, because its cut’s the EOT-flag also
  2. results based on shortened prompts by the CLIPer, are looking better (subjective opinion)
Read more comments on GitHub >

github_iconTop Results From Across the Web

Prompt max length - javascript - Stack Overflow
I tested on Chrome 72, I was able to input more than 50K char in prompt ... It should be no more than...
Read more >
For Text prompts, add a maximum length limitation for security ...
In 2021 Update 5, we add a limitation for the maximum length of the input string in text prompts for security concerns.
Read more >
Understanding prompts, completions, and tokens
Another key consideration is the prompt size. While a prompt can be any text, the prompt and the resulting completion must add up...
Read more >
Command prompt line string limitation - Windows Client
This article describes about limiting the maximum string length at the command prompt, and provides solutions to fix the limitation.
Read more >
<input>: The Input (Form Input) element - HTML
The input will fail constraint validation if the length of the text entered into the field is greater than maxlength UTF-16 code units...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found