Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

odd whitespace handling with imported sentencepiece models

See original GitHub issue

Environment info

transformers version: 4.7.0
Platform: Linux-4.15.0-143-generic-x86_64-with-Ubuntu-18.04-bionic
Python version: 3.6.9
PyTorch version (GPU?): 1.9.0+cu102 (False)
Tensorflow version (GPU?): not installed (NA)
Using GPU in script?: no
Using distributed or parallel set-up in script?: no

Who can help

@LysandreJik Although my example uses ReformerTokenizer, I think this problem is present in several of the model architectures using sentencepiece tokenizers.

Information

Model I am using (Bert, XLNet …): Reformer

The problem arises when using:

the official example scripts: (give details below)
my own modified scripts: (give details below)

The tasks I am working on is:

an official GLUE/SQUaD task: (give the name)
my own task or dataset: (give details below)

To reproduce

Steps to reproduce the behavior:

#!/usr/bin/env python3

import sentencepiece as spm
import transformers as tr

src = (
    'Lorem Ipsum dolor sit amet, consectetur adipiscing elit, sed do',
    'eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut',
    'enim ad minim veniam, quis nostrud exercitation ullamco laboris',
    'nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in',
    'reprehenderit in voluptate velit esse cillum dolore eu fugiat',
    'nulla pariatur.  Excepteur sint occaecat cupidatat non proident,',
    'sunt in culpa qui officia deserunt mollit anim id est laborum.',
)

spm.SentencePieceTrainer.train(
    sentence_iterator=iter(src),
    model_prefix='test',
    vocab_size=96,
    treat_whitespace_as_suffix=True,
    user_defined_symbols=['<pad>', '<mask>'],
    minloglevel=1,
)

def show(label, toks):
    print('%14s %2d: %s' % (label, len(toks), toks))

text = 'Lo<mask>m Ipsum'

tok = spm.SentencePieceProcessor(model_file='test.model')
show('sentencepiece', tok.encode(text, out_type=str))

tok = tr.models.reformer.ReformerTokenizerFast('test.model',
    mask_token='<mask>',
    pad_token='<pad>')
show('transformers', tok.tokenize(text))

tok.save_pretrained('test')

tr.models.reformer.ReformerConfig().save_pretrained('test')
tok = tr.AutoTokenizer.from_pretrained('test')
show('AutoTokenizer', tok.tokenize(text))

is giving

 sentencepiece  9: ['L', 'o', '<mask>', 'm▁', 'I', 'p', 's', 'um', '▁']
  transformers 10: ['▁', 'L', 'o', '<mask>', 'm', '▁', 'I', 'p', 's', 'um']
 AutoTokenizer 11: ['▁', 'L', 'o', '<mask>', '▁', 'm', '▁', 'I', 'p', 's', 'um']

Expected behavior

I believe the tokenization of input text should be more consistent. I think these variations are cropping up between my attempts to pretrain a language model and then later finetune the saved model, resulting in model accuracy problems.

The use of treat_whitespace_as_suffix=True in sentencepiece makes this problem worse, but using a sentencepiece model without this flag still shows the AutoTokenizer.from_pretrained() created tokenizer inserting whitespace that was not present in the source text. I haven’t been able to track down where this is coming from or how to avoid it.

Issue Analytics

State:
Created 2 years ago
Comments:11 (10 by maintainers)

Top GitHub Comments

2reactions

europeanplaicecommented, Jun 25, 2021

spm.SentencePieceTrainer and ReformerTokenizerFast are not the same tokenizers, so it’s not unusual that each of them outputs different results. However, I’m not sure how the two tokenizers are different. It’s because of the lack of my knowledge.

Regarding the difference between ReformerTokenizerFast and AutoTokenizer, I discovered something. One of the easiest ways to make the two tokenizers output the same results is to remove mask_token='<mask>' and test directory where previous config files exist (if there is test folder).

Another way is to remove special_tokens_map.json and tokenizer_config.json (after save_pretrained) that are unnecessary files when using fast tokenizer.

I don’t know what is the cause of this problem, but I guess there are conflicts between configurations of fast tokenizer and tokenizer.

0reactions

github-actions[bot]commented, Aug 7, 2021

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

Top Results From Across the Web

SentencePiece Tokenizer Demystified | by Jonathan Kernes

You can train non-whitespace delineated languages like Chinese and Japanese with the same ease as you would English or French. It can work...

SentencePiece: A simple and language independent subword ...

This paper describes SentencePiece, a language-independent subword ... Even whitespace is handled as a normal symbol. ... import sentencepiece as spm.

SentencePiece - ACL Anthology

SentencePiece : A simple and language independent subword tokenizer ... Even whitespace is handled as a normal symbol. ... import sentencepiece as spm....

Summary of the tokenizers - Hugging Face

More specifically, we will look at the three main types of tokenizers used in Transformers: Byte-Pair Encoding (BPE), WordPiece, and SentencePiece, and show ......

Subword Tokenizers for Pre-trained Models - Yekun Chai

In word segmentation, sentencepiece just segments tokens with whitespaces, so the input text must be pre-tokenized. We can apply different ...