Special tokens not tokenized properly
See original GitHub issueEnvironment info
transformers
version: 4.5.1- Python version: 3.8.5
- Using GPU in script?: no
- Using distributed or parallel set-up in script?: no
Who can help
Information
Hi,
I have recently further pretrained a RoBERTa model with fairseq. I use a custom vocabulary, trained with the tokenizers module. After converting the fairseq model to pytorch, I loaded all my model-related files here.
When loading the tokenizer, I noticed that the special tokens are not tokenized properly.
To reproduce
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained('manueltonneau/twibert-lowercase-50272')
tokenizer.tokenize('<mask>')
Out[7]: ['<mask>']
tokenizer.tokenize('<hashtag>')
Out[8]: ['hashtag']
tokenizer.tokenize('<hashtag>')
Out[3]: [0, 23958, 2]
Expected behavior
Since <hashtag>
is a special token in the vocabulary with ID 7 (see here), the last output should be: [0, 7, 2]. <hashtag>
with the ‘<>’ should also be recognized as a unique token.
Potential explanation
When looking at the files from a similar model, it seems that the vocab is in txt format and they also have the bpe.codes
file, which I don’t have. Could that be the issue? And if so, how do I convert my files to this format?
For vocab.txt, I have already found your lengthy explanation here, thanks for this.
Issue Analytics
- State:
- Created 2 years ago
- Comments:7 (2 by maintainers)
Top GitHub Comments
I created a new vocab with the tokenizers module for which I added new special tokens. Here is the code I use below:
Works fine, thanks again!