T5 Tokenizer Prepends Space after Each Added (Extra) Token
See original GitHub issueSystem Info
$ transformers-cli env
Copy-and-paste the text below in your GitHub issue and FILL OUT the two last points.
- `transformers` version: 4.22.1
- Platform: Linux-5.15.0-46-generic-x86_64-with-glibc2.31
- Python version: 3.10.7
- Huggingface_hub version: 0.9.1
- PyTorch version (GPU?): 1.12.0+cu116 (True)
- Tensorflow version (GPU?): not installed (NA)
- Flax version (CPU?/GPU?/TPU?): not installed (NA)
- Jax version: not installed
- JaxLib version: not installed
- Using GPU in script?: <fill in>
- Using distributed or parallel set-up in script?: <fill in>
Who can help?
Information
- The official example scripts
- My own modified scripts
Tasks
- An officially supported task in the
examples
folder (such as GLUE/SQuAD, …) - My own task or dataset (give details below)
Reproduction
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained('t5-base')
tokenizer.add_tokens(['<']) # '>' is already in the vocab
tokenizer.decode(tokenizer('a>=5').input_ids)
# prints 'a>=5</s>' as expected (no space after >)
tokenizer.decode(tokenizer('a<=5').input_ids)
# prints 'a< =5</s>'
### Expected behavior
There shouldn't be a space after the `<` character.
Issue Analytics
- State:
- Created a year ago
- Comments:14 (14 by maintainers)
Top Results From Across the Web
Utilities for Tokenizers - Hugging Face
Converts a sequence of ids in a string, using the tokenizer and vocabulary with options to remove special tokens and clean up tokenization...
Read more >Summarization: T5 - seekinginference
The guide proceeds by (1) preparing the data for text summarization with T5 small – a small version of T5 base, and (2)...
Read more >Extra whitespaces surrounding tokens - Marketing Nation
Hello. I'm trying to use tokens in my emails, but I encountered with following problem: when I write sth like "{{my.token}}." there is...
Read more >why does huggingface t5 tokenizer ignore some of the ...
The behaviour is explained by how the tokenize method in T5Tokenizer strips tokens by default. What one can do is adding the token...
Read more >Data — Texar-PyTorch v0.1
Base class inherited by all tokenizer classes. This class handles downloading and loading pre-trained tokenizer and adding tokens to the vocabulary.
Read more >Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start FreeTop Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Top GitHub Comments
Awesome, closing this issue, will open a PR in tokenizers when I have bandwith to try to match the outputs.
Well… this is not really intended ^^ But mostly the
fast
is an entire library mostly implemented inrust
, so we must have forgotten to update this argument when adding it to thetransformers
tokenizers. cc @LysandreJik and @SaulLu FYI 🤗