question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

T5 Tokenizer Prepends Space after Each Added (Extra) Token

See original GitHub issue

System Info

$ transformers-cli env

Copy-and-paste the text below in your GitHub issue and FILL OUT the two last points.

- `transformers` version: 4.22.1
- Platform: Linux-5.15.0-46-generic-x86_64-with-glibc2.31
- Python version: 3.10.7
- Huggingface_hub version: 0.9.1
- PyTorch version (GPU?): 1.12.0+cu116 (True)
- Tensorflow version (GPU?): not installed (NA)
- Flax version (CPU?/GPU?/TPU?): not installed (NA)
- Jax version: not installed
- JaxLib version: not installed
- Using GPU in script?: <fill in>
- Using distributed or parallel set-up in script?: <fill in>

Who can help?

@LysandreJik @SaulLu

Information

  • The official example scripts
  • My own modified scripts

Tasks

  • An officially supported task in the examples folder (such as GLUE/SQuAD, …)
  • My own task or dataset (give details below)

Reproduction

from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained('t5-base')
tokenizer.add_tokens(['<']) # '>' is already in the vocab

tokenizer.decode(tokenizer('a>=5').input_ids)
# prints 'a>=5</s>' as expected (no space after >)
tokenizer.decode(tokenizer('a<=5').input_ids)
# prints 'a< =5</s>'

### Expected behavior

There shouldn't be a space after the `<` character.

Issue Analytics

  • State:closed
  • Created a year ago
  • Comments:14 (14 by maintainers)

github_iconTop GitHub Comments

1reaction
ArthurZuckercommented, Oct 27, 2022

Awesome, closing this issue, will open a PR in tokenizers when I have bandwith to try to match the outputs.

1reaction
ArthurZuckercommented, Oct 27, 2022

Well… this is not really intended ^^ But mostly the fast is an entire library mostly implemented in rust, so we must have forgotten to update this argument when adding it to the transformers tokenizers. cc @LysandreJik and @SaulLu FYI 🤗

Read more comments on GitHub >

github_iconTop Results From Across the Web

Utilities for Tokenizers - Hugging Face
Converts a sequence of ids in a string, using the tokenizer and vocabulary with options to remove special tokens and clean up tokenization...
Read more >
Summarization: T5 - seekinginference
The guide proceeds by (1) preparing the data for text summarization with T5 small – a small version of T5 base, and (2)...
Read more >
Extra whitespaces surrounding tokens - Marketing Nation
Hello. I'm trying to use tokens in my emails, but I encountered with following problem: when I write sth like "{{my.token}}." there is...
Read more >
why does huggingface t5 tokenizer ignore some of the ...
The behaviour is explained by how the tokenize method in T5Tokenizer strips tokens by default. What one can do is adding the token...
Read more >
Data — Texar-PyTorch v0.1
Base class inherited by all tokenizer classes. This class handles downloading and loading pre-trained tokenizer and adding tokens to the vocabulary.
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found