question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Fast tokenizer converter leads to PanicException: no entry found for key

See original GitHub issue

I am working on adding PLBart’s tokenizer. The tokenizer uses sentencepiece.bpe.model and is similar to MBart. Hence, to convert to fast tokenizer, I used the same converter - MBartConverter and modified it. The definition is as follows and can also be found here:

class PLBartConverter(SpmConverter):
    def vocab(self, proto):
        vocab = [
            ("<s>", 0.0),
            ("<pad>", 0.0),
            ("</s>", 0.0),
            ("<unk>", 0.0),
        ]
        vocab += [(piece.piece, piece.score) for piece in proto.pieces[3:]]
        vocab += [("java", 0.0), ("python", 0.0), ("en_XX", 0.0)]
        vocab += [("<mask>", 0.0)]
        return vocab

    def unk_id(self, proto):
        return 3

    def post_processor(self):
        return processors.TemplateProcessing(
            single="$A </s> en_XX",
            pair="$A $B </s> en_XX",
            special_tokens=[
                ("en_XX", self.original_tokenizer.convert_tokens_to_ids("en_XX")),
                ("</s>", self.original_tokenizer.convert_tokens_to_ids("</s>")),
            ],
        )

However, running the conversion method

from transformers.convert_slow_tokenizers_checkpoints_to_fast import convert_slow_checkpoint_to_fast
convert_slow_checkpoint_to_fast('PLBartTokenizer','plbart-base', 'plbart-base', False)

leads to the following error:

Assigning ['java', 'python', 'en_XX'] to the additional_special_tokens key of the tokenizer
Save fast tokenizer to plbart-base with prefix plbart-base add_prefix True
=> plbart-base with prefix plbart-base, add_prefix True
tokenizer config file saved in plbart-base/plbart-base-tokenizer_config.json
Special tokens file saved in plbart-base/plbart-base-special_tokens_map.json
thread '<unnamed>' panicked at 'no entry found for key', /__w/tokenizers/tokenizers/tokenizers/src/models/mod.rs:36:66
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/crocoder/Desktop/transformers/src/transformers/convert_slow_tokenizers_checkpoints_to_fast.py", line 87, in convert_slow_checkpoint_to_fast
    file_names = tokenizer.save_pretrained(
  File "/home/crocoder/Desktop/transformers/src/transformers/tokenization_utils_base.py", line 2044, in save_pretrained
    save_files = self._save_pretrained(
  File "/home/crocoder/Desktop/transformers/src/transformers/tokenization_utils_fast.py", line 579, in _save_pretrained
    self.backend_tokenizer.save(tokenizer_file)
pyo3_runtime.PanicException: no entry found for key

Possible Fixes?

  • https://github.com/huggingface/tokenizers/issues/776 - Suggests removing special_tokens from the trainer. I assumed that is analogous to removing special_tokens from the post_processor? I tried it and it leads to the following error:
    Traceback (most recent call last):
      File "<stdin>", line 1, in <module>
      File "/home/crocoder/Desktop/transformers/src/transformers/convert_slow_tokenizers_checkpoints_to_fast.py", line 60, in convert_slow_checkpoint_to_fast
        tokenizer = tokenizer_class.from_pretrained(checkpoint, force_download=force_download)
      File "/home/crocoder/Desktop/transformers/src/transformers/tokenization_utils_base.py", line 1744, in from_pretrained
        return cls._from_pretrained(
      File "/home/crocoder/Desktop/transformers/src/transformers/tokenization_utils_base.py", line 1872, in _from_pretrained
        tokenizer = cls(*init_inputs, **init_kwargs)
      File "/home/crocoder/Desktop/transformers/src/transformers/models/plbart/tokenization_plbart_fast.py", line 138, in __init__
        super().__init__(
      File "/home/crocoder/Desktop/transformers/src/transformers/models/xlm_roberta/tokenization_xlm_roberta_fast.py", line 134, in __init__
        super().__init__(
      File "/home/crocoder/Desktop/transformers/src/transformers/tokenization_utils_fast.py", line 111, in __init__
        fast_tokenizer = convert_slow_tokenizer(slow_tokenizer)
      File "/home/crocoder/Desktop/transformers/src/transformers/convert_slow_tokenizer.py", line 1056, in convert_slow_tokenizer
        return converter_class(transformer_tokenizer).converted()
      File "/home/crocoder/Desktop/transformers/src/transformers/convert_slow_tokenizer.py", line 488, in converted
        post_processor = self.post_processor()
      File "/home/crocoder/Desktop/transformers/src/transformers/convert_slow_tokenizer.py", line 913, in post_processor
        return processors.TemplateProcessing(
    ValueError: Missing SpecialToken(s) with id(s) `</s>, en_XX`
    

Related issues

Issue Analytics

  • State:closed
  • Created 2 years ago
  • Comments:11 (9 by maintainers)

github_iconTop GitHub Comments

1reaction
patrickvonplatencommented, Dec 10, 2021

Taking care of it on Monday next week

1reaction
patrickvonplatencommented, Nov 8, 2021

Given that mbart has a very specific tokenization, we might have to add a new tokenization_plbart.py file in my opinion. Or do you think the MBart tokenizer is 1-to-1 correct for PLBart’s tokenization?

Read more comments on GitHub >

github_iconTop Results From Across the Web

Can't save ConvBert tokenizer - Hugging Face Forums
When i try to use tokenizer.save_pretrained() ... 600 file_names = file_names + (tokenizer_file,) 601 PanicException: no entry found for key.
Read more >
no entry found for key" error running simpletransformers on ...
I encountered this error while running simpletransformers on Google Colab. I enabled h/w accelerator as GPU and ran the code.
Read more >
Tokenizer reference | Elasticsearch Guide [8.5] | Elastic
A tokenizer receives a stream of characters, breaks it up into individual tokens (usually individual words), and outputs a stream of tokens. For...
Read more >
Tokenization (data security) - Wikipedia
A one-way cryptographic function is used to convert the original data into tokens, making it difficult to recreate the original data without obtaining...
Read more >
Tokenization Product Security Guidelines –
Information Supplement • Tokenization Product Security Guidelines • April 2015 ... Reversible Cryptographic and Non-Cryptographic Tokens .
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found