Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Unexpected behavior when input ends with multiple newlines

See original GitHub issue

System Info

transformers version: 4.15.0
Platform: Windows-10-10.0.19041-SP0
Python version: 3.8.5
PyTorch version (GPU?): 1.11.0+cu113 (True)
Tensorflow version (GPU?): 2.5.1 (True)
Flax version (CPU?/GPU?/TPU?): not installed (NA)
Jax version: not installed
JaxLib version: not installed
Using GPU in script?: yes
Using distributed or parallel set-up in script?: no

Who can help?

@patrickvonplaten, @Narsil

Information

The official example scripts
My own modified scripts

Tasks

An officially supported task in the examples folder (such as GLUE/SQuAD, …)
My own task or dataset (give details below)

Reproduction

from transformers import GPTNeoForCausalLM, GPT2Tokenizer

model_name = "EleutherAI/gpt-neo-125M"

model = GPTNeoForCausalLM.from_pretrained(model_name, low_cpu_mem_usage=True, cache_dir='gpt_cache_dir', resume_download=True).half().to("cuda:0")
tokenizer = GPT2Tokenizer.from_pretrained(model_name, low_cpu_mem_usage=True, cache_dir='gpt_cache_dir', resume_download=True)

input_ids = tokenizer("This is a line 1\n\nThis is a line 2\n\nThis is a line 3\n\n", return_tensors="pt").input_ids.cuda()
gen_tokens = model.generate(input_ids, do_sample=True, temperature=0.01, max_length=40, min_length=1, repetition_penalty=1.0)

gen_text = "Output: \"" + tokenizer.batch_decode(gen_tokens[:, input_ids.shape[1]:])[0] + "\""

print(gen_text)

Actual behavior: -If the input ends with 1 newline, generating multiple tokens works as expected, but generating just 1 token says the next token should be a newline by itself. -If the input ends with 2 newlines, generate multiple tokens doesn’t work as expected, and printing the next top score reveals the next token is some unexpected thing such as another newline or a token beginning with a space.

Expected behavior

Expected behavior: If prompt ends in \n\n, generated text shouldn’t start with \n.

Duplicate of https://github.com/huggingface/transformers/issues/17860 but it won’t let me re-open

Issue Analytics

State:
Created 10 months ago
Comments:5 (3 by maintainers)

Top GitHub Comments

1reaction

Narsilcommented, Dec 1, 2022

Isn’t this still considered a bug during tokenization? Shouldn’t the same input at each step lead to the same output?

Not really, all models usually have the basic ASCII chars, so the model is free to generate t + h + e which most likley will be in its vocabulary as the. Now this is usually not the case (since the model was usually not trained to output individual letters like here. But it’s definitely not a guarantee. Some models actually DO train on such irregular tokenizations, and this is called tokenization dropout. Benefits in general seems mitigated (some says it’s super important, some that it negatively impacts final performance. I personnally don’t have any opinion on this).

Is there a possible workaround, other than making sure certain types of inputs never get passed in?

You could do that. This is what is done under the hood for GPT-3 for instance, where you have these “START” and “STOP” sequence which are inserted for you as tokens, which avoids letting the tokenizer doing it on its own. For Bloom, we also had the same issue, where prompt perform better when it doesn’t end with a trailing space (so removing trailing spaces from prompts help the perceived quality of free text users). As far as I know, there is no “FIX” for it entirely.

If you could stick to using tokens, things would make more sense maybe, but it depends on the use case and how the model was trained really.

1reaction

Narsilcommented, Nov 29, 2022

Stop telling the model what it should do: quote.

Joke aside, how do you know what the model should do ? It’s a small model, so if it’s less performant than expected or than the larger ones is completely normal.