Infernal tokenizer loading trained
See original GitHub issueEnvironment info
transformers
version: 4.4.dev0- Platform: Ubuntu 18
- Python version: 3.7
- PyTorch version (GPU?): 1.7.1 (YES)
- Tensorflow version (GPU?):
- Using GPU in script?: YES
- Using distributed or parallel set-up in script?: NO
Who can help
@LysandreJik @patrickvonplaten @patil-suraj @sgugger @n1t0
Information
Model I am using (Bert, XLNet …): DeBerta
The problem arises when using:
- [ x] the official example scripts: (give details below)
- my own modified scripts: (give details below)
The tasks I am working on is:
- an official GLUE/SQUaD task: (give the name)
- [ x] my own task or dataset: (give details below)
To reproduce
Steps to reproduce the behavior:
- Use, for example, the OSCAR corpus in spanish, then use Tokenizers library to train your BPETokenizer (the one Deberta needs).
- Try to load DebertaTokenizer from the .json generated by Tokenizers.
The code used for training the tokenizer was the following:
import glob
import os
import random
from tokenizers import Tokenizer
from tokenizers.models import BPE
from tokenizers.trainers import BpeTrainer
if __name__ == "__main__":
tokenizer = Tokenizer(BPE())
# tokenizer = ByteLevelBPETokenizer(add_prefix_space=False)
trainer = BpeTrainer(
special_tokens=["[UNK]", "[CLS]", "[SEP]", "[PAD]", "[MASK]"],
vocab_size=50265,
continuing_subword_prefix="\u0120",
min_frequency=2,
)
# t = AutoTokenizer.from_pretrained("microsoft/deberta-base")
files = glob.glob("cleaned_train_data/*.csv")
files_sample = random.choices(files, k=250)
tokenizer.train(
files=files_sample,
trainer=trainer,
)
os.makedirs("bpe_tokenizer_0903", exist_ok=True)
tokenizer.save("bpe_tokenizer_0903")
The problem is that the DebertaTokenizer from transformers needs a different set of files to the ones Tokenizers generate. It’s ironic that it’s also a Huggingface library, because there doesn’t seem to be much integration between the 2. Well, as this was the case, I tried many things. First, I tried adding added_tokens.json, special_tokens_map.json, vocab.json, vocab.txt, merges.txt… All these files are included in tokenizer.json (the file generated by huggingface/Tokenizers). However, none of those worked. Then, I tried looking at the files that are saved when you load a DebertaTokenizer from microsoft checkpoints, so that I could copy the structure of the saved folder. I tried to do so, but for the bpe_encoder.bin, there were some difficulties. I used my merges for bpe_encoder[“vocab”], as the vocab in the Microsoft bpe_encoder.bin seemed to be merges, and in bpe_encoder[“encoder”] I put the vocab dict. For the field bpe_encoder[“dict_map”], I couldn’t replicate it as token frequencies are not saved by Tokenizers, so I invented them with a random number. However, when I try to train with this tokenizer, it throws a KeyError on step 5, which is strange because when I try to tokenize that concrete token: ‘Ŀ’, it does indeed tokenize it (by doing DebertaTokenizer.from_pretrained(my_path)(“Ŀ”))…
I think all those problems are caused mainly because there is a complete disconnection between Transformers and Tokenizers library, as the tokenizers trained with Tokenizers are not integrable with Transformers, which doesn’t make much sense to me, because Tokenizers is supposed to be used to train Tokenizers that are later used in Transformers…
Could please anyone tell me how can I train a Deberta Tokenizer that is, from the beginning, saved with the files needed by Transformers DebertaTokenizer?? Is there any version of Tokenizers in which, when you train a BPETokenizer, it saves the files required by Transformers?
Thank you very much.
Expected behavior
It is expected that if 2 libraries are from the same company and the mission of one of the two is to build tools that are later used by the other, the 2 libraries expect and produce the same objects for the same tasks, as it doesn’t make sense that you can train a BPETokenizer that you cannot later use as a tokenizer in Transformers. So, what is expected is that if DebertaTokenizer uses BPETokenizer, and this tokenizer expects to receive bpe_encoder.bin, special_tokens_map.json and tokenizer_config.json, then when you train a BPETokenizer with Tokenizers library it should save those objects, not a tokenizer.json file that is useless for later use in Transformers library.
Issue Analytics
- State:
- Created 3 years ago
- Comments:5 (1 by maintainers)
Top GitHub Comments
The tokenizers library powers the fast tokenizers, not the Python “slow” tokenizers. As there is no fast tokenizer for deberta, you can’t use the tokenizers library for that model.
You can check which tokenizers have a version backed by the Tokenizers library in this table.
This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.
Please note that issues that do not follow the contributing guidelines are likely to be ignored.