Unexpected behavior when input ends with multiple newlines
See original GitHub issueSystem Info
transformers
version: 4.15.0- Platform: Windows-10-10.0.19041-SP0
- Python version: 3.8.5
- PyTorch version (GPU?): 1.11.0+cu113 (True)
- Tensorflow version (GPU?): 2.5.1 (True)
- Flax version (CPU?/GPU?/TPU?): not installed (NA)
- Jax version: not installed
- JaxLib version: not installed
- Using GPU in script?: yes
- Using distributed or parallel set-up in script?: no
Who can help?
Information
- The official example scripts
- My own modified scripts
Tasks
- An officially supported task in the
examples
folder (such as GLUE/SQuAD, …) - My own task or dataset (give details below)
Reproduction
from transformers import GPTNeoForCausalLM, GPT2Tokenizer
model_name = "EleutherAI/gpt-neo-125M"
model = GPTNeoForCausalLM.from_pretrained(model_name, low_cpu_mem_usage=True, cache_dir='gpt_cache_dir', resume_download=True).half().to("cuda:0")
tokenizer = GPT2Tokenizer.from_pretrained(model_name, low_cpu_mem_usage=True, cache_dir='gpt_cache_dir', resume_download=True)
input_ids = tokenizer("This is a line 1\n\nThis is a line 2\n\nThis is a line 3\n\n", return_tensors="pt").input_ids.cuda()
gen_tokens = model.generate(input_ids, do_sample=True, temperature=0.01, max_length=40, min_length=1, repetition_penalty=1.0)
gen_text = "Output: \"" + tokenizer.batch_decode(gen_tokens[:, input_ids.shape[1]:])[0] + "\""
print(gen_text)
Actual behavior: -If the input ends with 1 newline, generating multiple tokens works as expected, but generating just 1 token says the next token should be a newline by itself. -If the input ends with 2 newlines, generate multiple tokens doesn’t work as expected, and printing the next top score reveals the next token is some unexpected thing such as another newline or a token beginning with a space.
Expected behavior
Expected behavior: If prompt ends in \n\n, generated text shouldn’t start with \n.
Duplicate of https://github.com/huggingface/transformers/issues/17860 but it won’t let me re-open
Issue Analytics
- State:
- Created 10 months ago
- Comments:5 (3 by maintainers)
Top Results From Across the Web
Unexpected behavior of universal newline mode with StringIO ...
It controls the handling of line endings. If it is None, universal newlines is enabled. With this enabled, on input, the lines endings...
Read more >Why do newline characters get lost when using command ...
Newlines get swapped out at some points because they are special characters. In order to keep them, you need to make sure they're...
Read more >ch02-00: one too many newlines in example output? #630
I'm new to Rust and I've encountered an issue that is quite puzzling to me; this issue pertains to the first guessing game...
Read more >13.10. Treatment of Newline during Input and Output - CLISP
The default behavior is as follows: Platform Dependent: Win32 platform only. When writing to a file, #\Newline is converted to CR/LF.
Read more >Line breaks in PowerApps multiline inputs and Flow approval ...
When using PowerApps multi-line text inputs, behavior on line breaks is not always consistent. You may have also noticed that your Microsoft ...
Read more >
Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free
Top Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Not really, all models usually have the basic ASCII chars, so the model is free to generate
t
+h
+e
which most likley will be in its vocabulary asthe
. Now this is usually not the case (since the model was usually not trained to output individual letters like here. But it’s definitely not a guarantee. Some models actually DO train on such irregular tokenizations, and this is called tokenizationdropout
. Benefits in general seems mitigated (some says it’s super important, some that it negatively impacts final performance. I personnally don’t have any opinion on this).You could do that. This is what is done under the hood for GPT-3 for instance, where you have these “START” and “STOP” sequence which are inserted for you as tokens, which avoids letting the tokenizer doing it on its own. For Bloom, we also had the same issue, where prompt perform better when it doesn’t end with a trailing space (so removing trailing spaces from prompts help the perceived quality of free text users). As far as I know, there is no “FIX” for it entirely.
If you could stick to using tokens, things would make more sense maybe, but it depends on the use case and how the model was trained really.
Stop telling the model what it should do: quote.
Joke aside, how do you know what the model should do ? It’s a small model, so if it’s less performant than expected or than the larger ones is completely normal.