Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Fast tokenizer converter leads to PanicException: no entry found for key

See original GitHub issue

I am working on adding PLBart’s tokenizer. The tokenizer uses sentencepiece.bpe.model and is similar to MBart. Hence, to convert to fast tokenizer, I used the same converter - MBartConverter and modified it. The definition is as follows and can also be found here:

class PLBartConverter(SpmConverter):
    def vocab(self, proto):
        vocab = [
            ("<s>", 0.0),
            ("<pad>", 0.0),
            ("</s>", 0.0),
            ("<unk>", 0.0),
        ]
        vocab += [(piece.piece, piece.score) for piece in proto.pieces[3:]]
        vocab += [("java", 0.0), ("python", 0.0), ("en_XX", 0.0)]
        vocab += [("<mask>", 0.0)]
        return vocab

    def unk_id(self, proto):
        return 3

    def post_processor(self):
        return processors.TemplateProcessing(
            single="$A </s> en_XX",
            pair="$A $B </s> en_XX",
            special_tokens=[
                ("en_XX", self.original_tokenizer.convert_tokens_to_ids("en_XX")),
                ("</s>", self.original_tokenizer.convert_tokens_to_ids("</s>")),
            ],
        )

However, running the conversion method

from transformers.convert_slow_tokenizers_checkpoints_to_fast import convert_slow_checkpoint_to_fast
convert_slow_checkpoint_to_fast('PLBartTokenizer','plbart-base', 'plbart-base', False)

leads to the following error:

Assigning ['java', 'python', 'en_XX'] to the additional_special_tokens key of the tokenizer
Save fast tokenizer to plbart-base with prefix plbart-base add_prefix True
=> plbart-base with prefix plbart-base, add_prefix True
tokenizer config file saved in plbart-base/plbart-base-tokenizer_config.json
Special tokens file saved in plbart-base/plbart-base-special_tokens_map.json
thread '<unnamed>' panicked at 'no entry found for key', /__w/tokenizers/tokenizers/tokenizers/src/models/mod.rs:36:66
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/crocoder/Desktop/transformers/src/transformers/convert_slow_tokenizers_checkpoints_to_fast.py", line 87, in convert_slow_checkpoint_to_fast
    file_names = tokenizer.save_pretrained(
  File "/home/crocoder/Desktop/transformers/src/transformers/tokenization_utils_base.py", line 2044, in save_pretrained
    save_files = self._save_pretrained(
  File "/home/crocoder/Desktop/transformers/src/transformers/tokenization_utils_fast.py", line 579, in _save_pretrained
    self.backend_tokenizer.save(tokenizer_file)
pyo3_runtime.PanicException: no entry found for key

Possible Fixes?

https://github.com/huggingface/tokenizers/issues/776 - Suggests removing special_tokens from the trainer. I assumed that is analogous to removing special_tokens from the post_processor? I tried it and it leads to the following error:

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/crocoder/Desktop/transformers/src/transformers/convert_slow_tokenizers_checkpoints_to_fast.py", line 60, in convert_slow_checkpoint_to_fast
    tokenizer = tokenizer_class.from_pretrained(checkpoint, force_download=force_download)
  File "/home/crocoder/Desktop/transformers/src/transformers/tokenization_utils_base.py", line 1744, in from_pretrained
    return cls._from_pretrained(
  File "/home/crocoder/Desktop/transformers/src/transformers/tokenization_utils_base.py", line 1872, in _from_pretrained
    tokenizer = cls(*init_inputs, **init_kwargs)
  File "/home/crocoder/Desktop/transformers/src/transformers/models/plbart/tokenization_plbart_fast.py", line 138, in __init__
    super().__init__(
  File "/home/crocoder/Desktop/transformers/src/transformers/models/xlm_roberta/tokenization_xlm_roberta_fast.py", line 134, in __init__
    super().__init__(
  File "/home/crocoder/Desktop/transformers/src/transformers/tokenization_utils_fast.py", line 111, in __init__
    fast_tokenizer = convert_slow_tokenizer(slow_tokenizer)
  File "/home/crocoder/Desktop/transformers/src/transformers/convert_slow_tokenizer.py", line 1056, in convert_slow_tokenizer
    return converter_class(transformer_tokenizer).converted()
  File "/home/crocoder/Desktop/transformers/src/transformers/convert_slow_tokenizer.py", line 488, in converted
    post_processor = self.post_processor()
  File "/home/crocoder/Desktop/transformers/src/transformers/convert_slow_tokenizer.py", line 913, in post_processor
    return processors.TemplateProcessing(
ValueError: Missing SpecialToken(s) with id(s) `</s>, en_XX`

Related issues

https://github.com/huggingface/tokenizers/issues/611
While this error can be found in #13443, I am unable to understand how to fix an existing sentencepiece.bpe.model file to remove non-consecutive tokens, if that is the case.
https://github.com/huggingface/tokenizers/issues/260 - Similar suggestion, but unclear what to do in case of a spm file.

Issue Analytics

State:
Created 2 years ago
Comments:11 (9 by maintainers)

Top GitHub Comments

1reaction

patrickvonplatencommented, Dec 10, 2021

Taking care of it on Monday next week

1reaction

patrickvonplatencommented, Nov 8, 2021

Given that mbart has a very specific tokenization, we might have to add a new tokenization_plbart.py file in my opinion. Or do you think the MBart tokenizer is 1-to-1 correct for PLBart’s tokenization?

Top Results From Across the Web

Can't save ConvBert tokenizer - Hugging Face Forums

When i try to use tokenizer.save_pretrained() ... 600 file_names = file_names + (tokenizer_file,) 601 PanicException: no entry found for key.

no entry found for key" error running simpletransformers on ...

I encountered this error while running simpletransformers on Google Colab. I enabled h/w accelerator as GPU and ran the code.

Tokenizer reference | Elasticsearch Guide [8.5] | Elastic

A tokenizer receives a stream of characters, breaks it up into individual tokens (usually individual words), and outputs a stream of tokens. For...

Tokenization (data security) - Wikipedia

A one-way cryptographic function is used to convert the original data into tokens, making it difficult to recreate the original data without obtaining...

Tokenization Product Security Guidelines –

Information Supplement • Tokenization Product Security Guidelines • April 2015 ... Reversible Cryptographic and Non-Cryptographic Tokens .