Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

DeBERTa-v3 does not preserve spaces before/after additional special tokens in convert_tokens_to_string output

See original GitHub issue

Environment info

transformers version: 4.12.5
Platform: macOS-10.16-x86_64-i386-64bit
Python version: 3.8.5
PyTorch version (GPU?): 1.10.0 (False)
Tensorflow version (GPU?): not installed (NA)
Flax version (CPU?/GPU?/TPU?): not installed (NA)
Jax version: not installed
JaxLib version: not installed
Using GPU in script?: No.
Using distributed or parallel set-up in script?: No.

Who can help

@LysandreJik

Information

Model I am using (Bert, XLNet …): microsoft/deberta-v3-small

The problem arises when using:

the official example scripts: (give details below)
my own modified scripts: (give details below)

The tasks I am working on is:

an official GLUE/SQUaD task: (give the name)
my own task or dataset: (give details below)

To reproduce

Steps to reproduce the behavior:

Initialize a DeBERTa-v3 tokenizer with additional_special_tokens.
Tokenize some text with tokenize that contains one or more of those special tokens.
Attempt to convert the tokens to a string with convert_tokens_to_string
DeBERTa-v3 does not include a space before/after the special token in the resulting string. BERT (and earlier versions of DeBERTa) do.

from transformers import AutoTokenizer, AutoModel

special_tokens = ["<SPECIAL>"]
text = "some text with an additional special token <SPECIAL>"

# BERT
tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased", additional_special_tokens=special_tokens)
print(tokenizer.convert_tokens_to_string(tokenizer.tokenize(text)))
# => some text with an additional special token <SPECIAL>

# DeBERTa (original)
tokenizer = AutoTokenizer.from_pretrained("microsoft/deberta-base", additional_special_tokens=special_tokens)
print(tokenizer.convert_tokens_to_string(tokenizer.tokenize(text)))
# => some text with an additional special token <SPECIAL>

# DeBERTa (v3)
tokenizer = AutoTokenizer.from_pretrained("microsoft/deberta-v3-small", additional_special_tokens=special_tokens)
print(tokenizer.convert_tokens_to_string(tokenizer.tokenize(text)))
# => some text with an additional special token<SPECIAL>

Expected behavior

I expect that spaces before/after any special tokens added with additional_special_tokens will be preserved when calling tokenizer.convert_tokens_to_string(tokenizer.tokenize(text)).

Issue Analytics

State:
Created 2 years ago
Comments:9 (7 by maintainers)

Top GitHub Comments

1reaction

SaulLucommented, Jan 18, 2022

Thank you very much for your answer! Very interesting use case!

And in particular, why on this use case do you need to use tokenizer.convert_tokens_to_string(tokenizer.tokenize(text)?

For DeBERTa (original and V3), I guess the tokenizer.decode(tokenizer.encode(text) command should give the result you were expecting initially. 😊

1reaction

JohnGiorgicommented, Jan 16, 2022

@LysandreJik @SaulLu This still happens on the latest version of Transformers and with the latest version of DeBERTa-v3, so I am commenting to keep it open.