DeBERTa-v3 does not preserve spaces before/after additional special tokens in convert_tokens_to_string output
See original GitHub issueEnvironment info
transformers
version: 4.12.5- Platform: macOS-10.16-x86_64-i386-64bit
- Python version: 3.8.5
- PyTorch version (GPU?): 1.10.0 (False)
- Tensorflow version (GPU?): not installed (NA)
- Flax version (CPU?/GPU?/TPU?): not installed (NA)
- Jax version: not installed
- JaxLib version: not installed
- Using GPU in script?: No.
- Using distributed or parallel set-up in script?: No.
Who can help
Information
Model I am using (Bert, XLNet …): microsoft/deberta-v3-small
The problem arises when using:
- the official example scripts: (give details below)
- my own modified scripts: (give details below)
The tasks I am working on is:
- an official GLUE/SQUaD task: (give the name)
- my own task or dataset: (give details below)
To reproduce
Steps to reproduce the behavior:
- Initialize a DeBERTa-v3 tokenizer with
additional_special_tokens
. - Tokenize some text with
tokenize
that contains one or more of those special tokens. - Attempt to convert the tokens to a string with
convert_tokens_to_string
- DeBERTa-v3 does not include a space before/after the special token in the resulting string. BERT (and earlier versions of DeBERTa) do.
from transformers import AutoTokenizer, AutoModel
special_tokens = ["<SPECIAL>"]
text = "some text with an additional special token <SPECIAL>"
# BERT
tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased", additional_special_tokens=special_tokens)
print(tokenizer.convert_tokens_to_string(tokenizer.tokenize(text)))
# => some text with an additional special token <SPECIAL>
# DeBERTa (original)
tokenizer = AutoTokenizer.from_pretrained("microsoft/deberta-base", additional_special_tokens=special_tokens)
print(tokenizer.convert_tokens_to_string(tokenizer.tokenize(text)))
# => some text with an additional special token <SPECIAL>
# DeBERTa (v3)
tokenizer = AutoTokenizer.from_pretrained("microsoft/deberta-v3-small", additional_special_tokens=special_tokens)
print(tokenizer.convert_tokens_to_string(tokenizer.tokenize(text)))
# => some text with an additional special token<SPECIAL>
Expected behavior
I expect that spaces before/after any special tokens added with additional_special_tokens
will be preserved when calling tokenizer.convert_tokens_to_string(tokenizer.tokenize(text))
.
Issue Analytics
- State:
- Created 2 years ago
- Comments:9 (7 by maintainers)
Top Results From Across the Web
BPEDecoder no spaces after special tokens - Intermediate
I have a custom BPE tokenizer, with a BPEDecoder (to fix additional spaces in the decoded output) but my decoded outputs have no...
Read more >Conversion of space characters into space tokens - TeX
Usually TeX does process input line by line: The whole line is read and the whole line is pre-processed. One step of pre-processing...
Read more >Spacy tokenizer with only "Whitespace" rule - Stack Overflow
Let's change nlp.tokenizer with a custom Tokenizer with token_match regex: import re import spacy from spacy.tokenizer import Tokenizer nlp ...
Read more >How to Train BPE, WordPiece, and Unigram Tokenizers from ...
Before we get to the fun part of training and comparing the ... These are tokens for unknown words and other special tokens...
Read more >Debertav3: Improving deberta using electra-style pre-training ...
where c is the index set of the masked tokens in the sequence. The authors of BERT propose to keep. 10% of the...
Read more >
Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free
Top Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Thank you very much for your answer! Very interesting use case!
And in particular, why on this use case do you need to use
tokenizer.convert_tokens_to_string(tokenizer.tokenize(text)
?For DeBERTa (original and V3), I guess the
tokenizer.decode(tokenizer.encode(text)
command should give the result you were expecting initially. 😊@LysandreJik @SaulLu This still happens on the latest version of Transformers and with the latest version of DeBERTa-v3, so I am commenting to keep it open.