Fast tokenizer converter leads to PanicException: no entry found for key
See original GitHub issueI am working on adding PLBart’s tokenizer. The tokenizer uses sentencepiece.bpe.model
and is similar to MBart. Hence, to convert to fast tokenizer, I used the same converter - MBartConverter
and modified it. The definition is as follows and can also be found here:
class PLBartConverter(SpmConverter):
def vocab(self, proto):
vocab = [
("<s>", 0.0),
("<pad>", 0.0),
("</s>", 0.0),
("<unk>", 0.0),
]
vocab += [(piece.piece, piece.score) for piece in proto.pieces[3:]]
vocab += [("java", 0.0), ("python", 0.0), ("en_XX", 0.0)]
vocab += [("<mask>", 0.0)]
return vocab
def unk_id(self, proto):
return 3
def post_processor(self):
return processors.TemplateProcessing(
single="$A </s> en_XX",
pair="$A $B </s> en_XX",
special_tokens=[
("en_XX", self.original_tokenizer.convert_tokens_to_ids("en_XX")),
("</s>", self.original_tokenizer.convert_tokens_to_ids("</s>")),
],
)
However, running the conversion method
from transformers.convert_slow_tokenizers_checkpoints_to_fast import convert_slow_checkpoint_to_fast
convert_slow_checkpoint_to_fast('PLBartTokenizer','plbart-base', 'plbart-base', False)
leads to the following error:
Assigning ['java', 'python', 'en_XX'] to the additional_special_tokens key of the tokenizer
Save fast tokenizer to plbart-base with prefix plbart-base add_prefix True
=> plbart-base with prefix plbart-base, add_prefix True
tokenizer config file saved in plbart-base/plbart-base-tokenizer_config.json
Special tokens file saved in plbart-base/plbart-base-special_tokens_map.json
thread '<unnamed>' panicked at 'no entry found for key', /__w/tokenizers/tokenizers/tokenizers/src/models/mod.rs:36:66
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/home/crocoder/Desktop/transformers/src/transformers/convert_slow_tokenizers_checkpoints_to_fast.py", line 87, in convert_slow_checkpoint_to_fast
file_names = tokenizer.save_pretrained(
File "/home/crocoder/Desktop/transformers/src/transformers/tokenization_utils_base.py", line 2044, in save_pretrained
save_files = self._save_pretrained(
File "/home/crocoder/Desktop/transformers/src/transformers/tokenization_utils_fast.py", line 579, in _save_pretrained
self.backend_tokenizer.save(tokenizer_file)
pyo3_runtime.PanicException: no entry found for key
Possible Fixes?
- https://github.com/huggingface/tokenizers/issues/776 - Suggests removing
special_tokens
from the trainer. I assumed that is analogous to removingspecial_tokens
from thepost_processor
? I tried it and it leads to the following error:Traceback (most recent call last): File "<stdin>", line 1, in <module> File "/home/crocoder/Desktop/transformers/src/transformers/convert_slow_tokenizers_checkpoints_to_fast.py", line 60, in convert_slow_checkpoint_to_fast tokenizer = tokenizer_class.from_pretrained(checkpoint, force_download=force_download) File "/home/crocoder/Desktop/transformers/src/transformers/tokenization_utils_base.py", line 1744, in from_pretrained return cls._from_pretrained( File "/home/crocoder/Desktop/transformers/src/transformers/tokenization_utils_base.py", line 1872, in _from_pretrained tokenizer = cls(*init_inputs, **init_kwargs) File "/home/crocoder/Desktop/transformers/src/transformers/models/plbart/tokenization_plbart_fast.py", line 138, in __init__ super().__init__( File "/home/crocoder/Desktop/transformers/src/transformers/models/xlm_roberta/tokenization_xlm_roberta_fast.py", line 134, in __init__ super().__init__( File "/home/crocoder/Desktop/transformers/src/transformers/tokenization_utils_fast.py", line 111, in __init__ fast_tokenizer = convert_slow_tokenizer(slow_tokenizer) File "/home/crocoder/Desktop/transformers/src/transformers/convert_slow_tokenizer.py", line 1056, in convert_slow_tokenizer return converter_class(transformer_tokenizer).converted() File "/home/crocoder/Desktop/transformers/src/transformers/convert_slow_tokenizer.py", line 488, in converted post_processor = self.post_processor() File "/home/crocoder/Desktop/transformers/src/transformers/convert_slow_tokenizer.py", line 913, in post_processor return processors.TemplateProcessing( ValueError: Missing SpecialToken(s) with id(s) `</s>, en_XX`
Related issues
- https://github.com/huggingface/tokenizers/issues/611
- While this error can be found in #13443, I am unable to understand how to fix an existing
sentencepiece.bpe.model
file to remove non-consecutive tokens, if that is the case. - https://github.com/huggingface/tokenizers/issues/260 - Similar suggestion, but unclear what to do in case of a spm file.
Issue Analytics
- State:
- Created 2 years ago
- Comments:11 (9 by maintainers)
Top Results From Across the Web
Can't save ConvBert tokenizer - Hugging Face Forums
When i try to use tokenizer.save_pretrained() ... 600 file_names = file_names + (tokenizer_file,) 601 PanicException: no entry found for key.
Read more >no entry found for key" error running simpletransformers on ...
I encountered this error while running simpletransformers on Google Colab. I enabled h/w accelerator as GPU and ran the code.
Read more >Tokenizer reference | Elasticsearch Guide [8.5] | Elastic
A tokenizer receives a stream of characters, breaks it up into individual tokens (usually individual words), and outputs a stream of tokens. For...
Read more >Tokenization (data security) - Wikipedia
A one-way cryptographic function is used to convert the original data into tokens, making it difficult to recreate the original data without obtaining...
Read more >Tokenization Product Security Guidelines –
Information Supplement • Tokenization Product Security Guidelines • April 2015 ... Reversible Cryptographic and Non-Cryptographic Tokens .
Read more >
Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free
Top Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Taking care of it on Monday next week
Given that
mbart
has a very specific tokenization, we might have to add a newtokenization_plbart.py
file in my opinion. Or do you think the MBart tokenizer is 1-to-1 correct for PLBart’s tokenization?