Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

T5 Tokenizer Prepends Space after Each Added (Extra) Token

See original GitHub issue

System Info

$ transformers-cli env

Copy-and-paste the text below in your GitHub issue and FILL OUT the two last points.

- `transformers` version: 4.22.1
- Platform: Linux-5.15.0-46-generic-x86_64-with-glibc2.31
- Python version: 3.10.7
- Huggingface_hub version: 0.9.1
- PyTorch version (GPU?): 1.12.0+cu116 (True)
- Tensorflow version (GPU?): not installed (NA)
- Flax version (CPU?/GPU?/TPU?): not installed (NA)
- Jax version: not installed
- JaxLib version: not installed
- Using GPU in script?: <fill in>
- Using distributed or parallel set-up in script?: <fill in>

Who can help?

@LysandreJik @SaulLu

Information

The official example scripts
My own modified scripts

Tasks

An officially supported task in the examples folder (such as GLUE/SQuAD, …)
My own task or dataset (give details below)

Reproduction

from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained('t5-base')
tokenizer.add_tokens(['<']) # '>' is already in the vocab

tokenizer.decode(tokenizer('a>=5').input_ids)
# prints 'a>=5</s>' as expected (no space after >)
tokenizer.decode(tokenizer('a<=5').input_ids)
# prints 'a< =5</s>'

### Expected behavior

There shouldn't be a space after the `<` character.

Issue Analytics

State:
Created a year ago
Comments:14 (14 by maintainers)

Top GitHub Comments

1reaction

ArthurZuckercommented, Oct 27, 2022

Awesome, closing this issue, will open a PR in tokenizers when I have bandwith to try to match the outputs.

1reaction

ArthurZuckercommented, Oct 27, 2022

Well… this is not really intended ^^ But mostly the fast is an entire library mostly implemented in rust, so we must have forgotten to update this argument when adding it to the transformers tokenizers. cc @LysandreJik and @SaulLu FYI 🤗