Adding tokens to `RobertaTokenizer` is fast, but loading the extended tokenizer from disk takes tens of minutes
See original GitHub issueSystem Info
- `transformers` version: 4.18.0
- Platform: Linux-5.10.0-0.bpo.9-amd64-x86_64-with-debian-10.12
- Python version: 3.7.3
- Huggingface_hub version: 0.5.1
- PyTorch version (GPU?): 1.11.0+cu102 (False)
- Tensorflow version (GPU?): not installed (NA)
- Flax version (CPU?/GPU?/TPU?): not installed (NA)
- Jax version: not installed
- JaxLib version: not installed
- Using GPU in script?: false
- Using distributed or parallel set-up in script?: false
Who can help?
Information
- The official example scripts
- My own modified scripts
Tasks
- An officially supported task in the
examples
folder (such as GLUE/SQuAD, …) - My own task or dataset (give details below)
Reproduction
I train a BPE tokenizer on a domain-specific dataset and save it as tokenizer-latex.json
.
>>> from tokenizers import Tokenizer, normalizers, pre_tokenizers
>>> from tokenizers.models import BPE
>>> from tokenizers.trainers import BpeTrainer
>>>
>>> latex_model = BPE(unk_token='[UNK]')
>>> latex_tokenizer = Tokenizer(latex_model)
>>> latex_tokenizer.pre_tokenizer = pre_tokenizers.WhitespaceSplit()
>>> latex_tokenizer.normalizer = normalizers.Sequence([normalizers.Strip()])
>>> latex_tokenizer_trainer = BpeTrainer(special_tokens=['[UNK]'])
>>> latex_tokenizer.train(['dataset-latex.txt'], latex_tokenizer_trainer)
>>> latex_tokenizer.save('tokenizer-latex.json')
Then, I extend the pre-trained roberta-base
tokenizer with 28,141 new tokens from the vocabulary of my BPE tokenizer and I save the result to the directory ./extended-roberta-base/
. This finishes in a matter of seconds:
>>> from tokenizers import Tokenizer
>>> from transformers import RobertaTokenizer
>>>
>>> latex_tokenizer = Tokenizer.from_file('tokenizer-latex.json')
>>>
>>> text_latex_tokenizer = RobertaTokenizer.from_pretrained('roberta-base', add_prefix_space=True)
>>> text_latex_tokenizer.add_tokens(list(latex_tokenizer.get_vocab()))
28141
>>> text_latex_tokenizer.save_pretrained('./extended-roberta-base/')
('./extended-roberta-base/tokenizer_config.json', './extended-roberta-base/special_tokens_map.json',
'./extended-roberta-base/vocab.json', './extended-roberta-base/merges.txt',
'./extended-roberta-base/added_tokens.json', './extended-roberta-base/tokenizer.json')
However, when I load the extended roberta-base
tokenizer from the directory ./extended-roberta-base/
, the library constructs a trie (see https://github.com/huggingface/transformers/pull/13220) over the course of ca 20 minutes:
>>> from transformers import RobertaTokenizer
>>>
>>> text_latex_tokenizer = RobertaTokenizer.from_pretrained('./extended-roberta-base/')
^C
Traceback (most recent call last):
File "<stdin>", line 2, in <module>
text_latex_tokenizer = RobertaTokenizer.from_pretrained('./extended-roberta-base/')
File "***/python3.7/site-packages/transformers/tokenization_utils_base.py", line 1787, in from_pretrained
**kwargs,
File "***/python3.7/site-packages/transformers/tokenization_utils_base.py", line 1971, in _from_pretrained
tokenizer.add_tokens(token, special_tokens=bool(token in special_tokens))
File "***/python3.7/site-packages/transformers/tokenization_utils_base.py", line 945, in add_tokens
return self._add_tokens(new_tokens, special_tokens=special_tokens)
File "***/python3.7/site-packages/transformers/tokenization_utils.py", line 444, in _add_tokens
self._create_trie(self.unique_no_split_tokens)
File "***/python3.7/site-packages/transformers/tokenization_utils.py", line 454, in _create_trie
trie.add(token)
File "***/python3.7/site-packages/transformers/tokenization_utils.py", line 87, in add
ref = ref[char]
KeyboardInterrupt
The time disparity leads me to believe that when RobertaTokenizer.add_tokens()
is called, a trie is either not created or is created extremely fast, whereas when RobertaTokenizer.from_pretrained()
is called, a trie is created (slowly). Using RobertaTokenizerFast
instead of RobertaTokenizer
produces similar results at a similar timescale.
Expected behavior
Both add_tokens()
and from_pretrained()
should take comparable amount of time. Either building the trie is important and cannot be sped up, in which case add_tokens()
should also take roughly 20 minutes, or building the trie is unimportant or can be sped up, in which case from_pretrained()
should finish in a matter of seconds.
Issue Analytics
- State:
- Created a year ago
- Comments:14 (14 by maintainers)
We did it in tokenizers
since the
Trie` insertion order of added tokens should not be important (this is also currently the case in slow tokenizers)https://github.com/huggingface/tokenizers/blob/main/tokenizers/src/tokenizer/serialization.rs#L172
There might be other things to deal with in the python code, but the
Trie
itself doesn’t care about insertion order, so we can create it only once.Thanks a lot for working on this fix.
On my side, I’m trying to look at your PR tomorrow. As this is a change that will impact all tokenizers, this is a contribution that requires a very attentive review on our part, that’s why it can be a bit long.