question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Infernal tokenizer loading trained

See original GitHub issue

Environment info

  • transformers version: 4.4.dev0
  • Platform: Ubuntu 18
  • Python version: 3.7
  • PyTorch version (GPU?): 1.7.1 (YES)
  • Tensorflow version (GPU?):
  • Using GPU in script?: YES
  • Using distributed or parallel set-up in script?: NO

Who can help

@LysandreJik @patrickvonplaten @patil-suraj @sgugger @n1t0

Information

Model I am using (Bert, XLNet …): DeBerta

The problem arises when using:

  • [ x] the official example scripts: (give details below)
  • my own modified scripts: (give details below)

The tasks I am working on is:

  • an official GLUE/SQUaD task: (give the name)
  • [ x] my own task or dataset: (give details below)

To reproduce

Steps to reproduce the behavior:

  1. Use, for example, the OSCAR corpus in spanish, then use Tokenizers library to train your BPETokenizer (the one Deberta needs).
  2. Try to load DebertaTokenizer from the .json generated by Tokenizers.

The code used for training the tokenizer was the following:

import glob
import os
import random

from tokenizers import Tokenizer
from tokenizers.models import BPE
from tokenizers.trainers import BpeTrainer

if __name__ == "__main__":
    tokenizer = Tokenizer(BPE())
    # tokenizer = ByteLevelBPETokenizer(add_prefix_space=False)

    trainer = BpeTrainer(
        special_tokens=["[UNK]", "[CLS]", "[SEP]", "[PAD]", "[MASK]"],
        vocab_size=50265,
        continuing_subword_prefix="\u0120",
        min_frequency=2,
    )
    # t = AutoTokenizer.from_pretrained("microsoft/deberta-base")
    files = glob.glob("cleaned_train_data/*.csv")
    files_sample = random.choices(files, k=250)
    tokenizer.train(
        files=files_sample,
        trainer=trainer,
    )
    os.makedirs("bpe_tokenizer_0903", exist_ok=True)
    tokenizer.save("bpe_tokenizer_0903")

The problem is that the DebertaTokenizer from transformers needs a different set of files to the ones Tokenizers generate. It’s ironic that it’s also a Huggingface library, because there doesn’t seem to be much integration between the 2. Well, as this was the case, I tried many things. First, I tried adding added_tokens.json, special_tokens_map.json, vocab.json, vocab.txt, merges.txt… All these files are included in tokenizer.json (the file generated by huggingface/Tokenizers). However, none of those worked. Then, I tried looking at the files that are saved when you load a DebertaTokenizer from microsoft checkpoints, so that I could copy the structure of the saved folder. I tried to do so, but for the bpe_encoder.bin, there were some difficulties. I used my merges for bpe_encoder[“vocab”], as the vocab in the Microsoft bpe_encoder.bin seemed to be merges, and in bpe_encoder[“encoder”] I put the vocab dict. For the field bpe_encoder[“dict_map”], I couldn’t replicate it as token frequencies are not saved by Tokenizers, so I invented them with a random number. However, when I try to train with this tokenizer, it throws a KeyError on step 5, which is strange because when I try to tokenize that concrete token: ‘Ŀ’, it does indeed tokenize it (by doing DebertaTokenizer.from_pretrained(my_path)(“Ŀ”))…

I think all those problems are caused mainly because there is a complete disconnection between Transformers and Tokenizers library, as the tokenizers trained with Tokenizers are not integrable with Transformers, which doesn’t make much sense to me, because Tokenizers is supposed to be used to train Tokenizers that are later used in Transformers…

Could please anyone tell me how can I train a Deberta Tokenizer that is, from the beginning, saved with the files needed by Transformers DebertaTokenizer?? Is there any version of Tokenizers in which, when you train a BPETokenizer, it saves the files required by Transformers?

Thank you very much.

Expected behavior

It is expected that if 2 libraries are from the same company and the mission of one of the two is to build tools that are later used by the other, the 2 libraries expect and produce the same objects for the same tasks, as it doesn’t make sense that you can train a BPETokenizer that you cannot later use as a tokenizer in Transformers. So, what is expected is that if DebertaTokenizer uses BPETokenizer, and this tokenizer expects to receive bpe_encoder.bin, special_tokens_map.json and tokenizer_config.json, then when you train a BPETokenizer with Tokenizers library it should save those objects, not a tokenizer.json file that is useless for later use in Transformers library.

Issue Analytics

  • State:closed
  • Created 3 years ago
  • Comments:5 (1 by maintainers)

github_iconTop GitHub Comments

1reaction
sguggercommented, Mar 11, 2021

The tokenizers library powers the fast tokenizers, not the Python “slow” tokenizers. As there is no fast tokenizer for deberta, you can’t use the tokenizers library for that model.

You can check which tokenizers have a version backed by the Tokenizers library in this table.

0reactions
github-actions[bot]commented, Apr 14, 2021

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

Read more comments on GitHub >

github_iconTop Results From Across the Web

Training a new tokenizer from an old one - Hugging Face
Training a tokenizer is a statistical process that tries to identify which subwords are the best to pick for a given corpus, and...
Read more >
Inference API: Can't load tokenizer using from_pretrained ...
I uploaded the tokenizer files to colab, and I was able to instantiate a tokenizer with the from_pretrained method, so I don't know...
Read more >
How to use [HuggingFace's] Transformers Pre-Trained ...
For complete instruction, you can visit the installation section in the document. After that, we need to load the pre-trained tokenizer. By the ......
Read more >
Loading a HuggingFace model into AllenNLP gives different ...
trainer.save_model(model_name) tokenizer.save_pretrained(model_name). I'm trying to load such persisted model using the allennlp library for ...
Read more >
Get your own tokenizer with Transformers & Tokenizers
Lucile teaches us how to build and train a custom tokenizer and how to use in Transformers.Lucile is a machine learning engineer at...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found