odd whitespace handling with imported sentencepiece models
See original GitHub issueEnvironment info
transformers
version: 4.7.0- Platform: Linux-4.15.0-143-generic-x86_64-with-Ubuntu-18.04-bionic
- Python version: 3.6.9
- PyTorch version (GPU?): 1.9.0+cu102 (False)
- Tensorflow version (GPU?): not installed (NA)
- Using GPU in script?: no
- Using distributed or parallel set-up in script?: no
Who can help
@LysandreJik Although my example uses ReformerTokenizer, I think this problem is present in several of the model architectures using sentencepiece tokenizers.
Information
Model I am using (Bert, XLNet …): Reformer
The problem arises when using:
- the official example scripts: (give details below)
- my own modified scripts: (give details below)
The tasks I am working on is:
- an official GLUE/SQUaD task: (give the name)
- my own task or dataset: (give details below)
To reproduce
Steps to reproduce the behavior:
#!/usr/bin/env python3
import sentencepiece as spm
import transformers as tr
src = (
'Lorem Ipsum dolor sit amet, consectetur adipiscing elit, sed do',
'eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut',
'enim ad minim veniam, quis nostrud exercitation ullamco laboris',
'nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in',
'reprehenderit in voluptate velit esse cillum dolore eu fugiat',
'nulla pariatur. Excepteur sint occaecat cupidatat non proident,',
'sunt in culpa qui officia deserunt mollit anim id est laborum.',
)
spm.SentencePieceTrainer.train(
sentence_iterator=iter(src),
model_prefix='test',
vocab_size=96,
treat_whitespace_as_suffix=True,
user_defined_symbols=['<pad>', '<mask>'],
minloglevel=1,
)
def show(label, toks):
print('%14s %2d: %s' % (label, len(toks), toks))
text = 'Lo<mask>m Ipsum'
tok = spm.SentencePieceProcessor(model_file='test.model')
show('sentencepiece', tok.encode(text, out_type=str))
tok = tr.models.reformer.ReformerTokenizerFast('test.model',
mask_token='<mask>',
pad_token='<pad>')
show('transformers', tok.tokenize(text))
tok.save_pretrained('test')
tr.models.reformer.ReformerConfig().save_pretrained('test')
tok = tr.AutoTokenizer.from_pretrained('test')
show('AutoTokenizer', tok.tokenize(text))
is giving
sentencepiece 9: ['L', 'o', '<mask>', 'm▁', 'I', 'p', 's', 'um', '▁']
transformers 10: ['▁', 'L', 'o', '<mask>', 'm', '▁', 'I', 'p', 's', 'um']
AutoTokenizer 11: ['▁', 'L', 'o', '<mask>', '▁', 'm', '▁', 'I', 'p', 's', 'um']
Expected behavior
I believe the tokenization of input text should be more consistent. I think these variations are cropping up between my attempts to pretrain a language model and then later finetune the saved model, resulting in model accuracy problems.
The use of treat_whitespace_as_suffix=True
in sentencepiece
makes this problem worse, but using a sentencepiece model without this flag still shows the AutoTokenizer.from_pretrained()
created tokenizer inserting whitespace that was not present in the source text. I haven’t been able to track down where this is coming from or how to avoid it.
Issue Analytics
- State:
- Created 2 years ago
- Comments:11 (10 by maintainers)
Top GitHub Comments
spm.SentencePieceTrainer
andReformerTokenizerFast
are not the same tokenizers, so it’s not unusual that each of them outputs different results. However, I’m not sure how the two tokenizers are different. It’s because of the lack of my knowledge.Regarding the difference between
ReformerTokenizerFast
andAutoTokenizer
, I discovered something. One of the easiest ways to make the two tokenizers output the same results is to removemask_token='<mask>'
andtest
directory where previous config files exist (if there istest
folder).Another way is to remove
special_tokens_map.json
andtokenizer_config.json
(aftersave_pretrained
) that are unnecessary files when using fast tokenizer.I don’t know what is the cause of this problem, but I guess there are conflicts between configurations of fast tokenizer and tokenizer.
This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.
Please note that issues that do not follow the contributing guidelines are likely to be ignored.