question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

DeBERTa-v3 does not preserve spaces before/after additional special tokens in convert_tokens_to_string output

See original GitHub issue

Environment info

  • transformers version: 4.12.5
  • Platform: macOS-10.16-x86_64-i386-64bit
  • Python version: 3.8.5
  • PyTorch version (GPU?): 1.10.0 (False)
  • Tensorflow version (GPU?): not installed (NA)
  • Flax version (CPU?/GPU?/TPU?): not installed (NA)
  • Jax version: not installed
  • JaxLib version: not installed
  • Using GPU in script?: No.
  • Using distributed or parallel set-up in script?: No.

Who can help

@LysandreJik

Information

Model I am using (Bert, XLNet …): microsoft/deberta-v3-small

The problem arises when using:

  • the official example scripts: (give details below)
  • my own modified scripts: (give details below)

The tasks I am working on is:

  • an official GLUE/SQUaD task: (give the name)
  • my own task or dataset: (give details below)

To reproduce

Steps to reproduce the behavior:

  1. Initialize a DeBERTa-v3 tokenizer with additional_special_tokens.
  2. Tokenize some text with tokenize that contains one or more of those special tokens.
  3. Attempt to convert the tokens to a string with convert_tokens_to_string
  4. DeBERTa-v3 does not include a space before/after the special token in the resulting string. BERT (and earlier versions of DeBERTa) do.
from transformers import AutoTokenizer, AutoModel

special_tokens = ["<SPECIAL>"]
text = "some text with an additional special token <SPECIAL>"

# BERT
tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased", additional_special_tokens=special_tokens)
print(tokenizer.convert_tokens_to_string(tokenizer.tokenize(text)))
# => some text with an additional special token <SPECIAL>

# DeBERTa (original)
tokenizer = AutoTokenizer.from_pretrained("microsoft/deberta-base", additional_special_tokens=special_tokens)
print(tokenizer.convert_tokens_to_string(tokenizer.tokenize(text)))
# => some text with an additional special token <SPECIAL>

# DeBERTa (v3)
tokenizer = AutoTokenizer.from_pretrained("microsoft/deberta-v3-small", additional_special_tokens=special_tokens)
print(tokenizer.convert_tokens_to_string(tokenizer.tokenize(text)))
# => some text with an additional special token<SPECIAL>

Expected behavior

I expect that spaces before/after any special tokens added with additional_special_tokens will be preserved when calling tokenizer.convert_tokens_to_string(tokenizer.tokenize(text)).

Issue Analytics

  • State:closed
  • Created 2 years ago
  • Comments:9 (7 by maintainers)

github_iconTop GitHub Comments

1reaction
SaulLucommented, Jan 18, 2022

Thank you very much for your answer! Very interesting use case!

And in particular, why on this use case do you need to use tokenizer.convert_tokens_to_string(tokenizer.tokenize(text)?

For DeBERTa (original and V3), I guess the tokenizer.decode(tokenizer.encode(text) command should give the result you were expecting initially. 😊

1reaction
JohnGiorgicommented, Jan 16, 2022

@LysandreJik @SaulLu This still happens on the latest version of Transformers and with the latest version of DeBERTa-v3, so I am commenting to keep it open.

Read more comments on GitHub >

github_iconTop Results From Across the Web

BPEDecoder no spaces after special tokens - Intermediate
I have a custom BPE tokenizer, with a BPEDecoder (to fix additional spaces in the decoded output) but my decoded outputs have no...
Read more >
Conversion of space characters into space tokens - TeX
Usually TeX does process input line by line: The whole line is read and the whole line is pre-processed. One step of pre-processing...
Read more >
Spacy tokenizer with only "Whitespace" rule - Stack Overflow
Let's change nlp.tokenizer with a custom Tokenizer with token_match regex: import re import spacy from spacy.tokenizer import Tokenizer nlp ...
Read more >
How to Train BPE, WordPiece, and Unigram Tokenizers from ...
Before we get to the fun part of training and comparing the ... These are tokens for unknown words and other special tokens...
Read more >
Debertav3: Improving deberta using electra-style pre-training ...
where c is the index set of the masked tokens in the sequence. The authors of BERT propose to keep. 10% of the...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found