fast tokenizer issue on most user uploaded models
See original GitHub issueEnvironment info
transformers
version: 3.4.0- Platform: Linux-5.8.0-25-generic-x86_64-with-glibc2.32
- Python version: 3.8.6
- PyTorch version (GPU?): 1.6.0 (True)
- Tensorflow version (GPU?): not installed (NA)
- Using GPU in script?: yes
- Using distributed or parallel set-up in script?: yes
Who can help
Information
Found the bug on camembert/camembert-base-ccnet
but probably common to many models uploaded by users.
On camembert base model, it works out of the box (there is no bug).
To reproduce
Since tokenizer
0.9, it’s possible to load the many unigram based tokenizers with the fast Rust implementation.
It appears that the file tokenizer_config.json
of some of them is not up to date, in particular the information "model_max_length": 512
is missing.
Because of that, the value of model_max_length
is a very big integer.
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("camembert/camembert-base-ccnet", use_fast=True)
tokenizer.model_max_length
# Out[4]: 1000000000000000019884624838656
To fix it, the field model_max_length has to be added to the config file.
Expected behavior
I would expect tokenizer.model_max_length
to be equal to 512.
Issue Analytics
- State:
- Created 3 years ago
- Comments:7 (6 by maintainers)
Top Results From Across the Web
Tokenizer - Hugging Face
When the tokenizer is a “Fast” tokenizer (i.e., backed by HuggingFace ... Upload the tokenizer files to the Model Hub while synchronizing a...
Read more >Charformer: Fast Character Transformers via Gradient-based ...
Summary Of The Paper: This paper aims to remove the reliance of NLP models to external tokenizers. It proposes a gradient-based subword tokenization...
Read more >tokenizers - PyPI
A fast and easy to use implementation of today's most used tokenizers. High Level design: master. This API is currently in the process...
Read more >A Fast WordPiece Tokenization System - MKAI
Modern NLP models address this issue by tokenizing text into subword units, which often retain linguistic meaning (e.g., morphemes).
Read more >Charformer: Fast Character Transformers via Gradient ... - arXiv
In this paper, we propose a new model inductive bias that learns a subword tokenization end-to-end as part of the model. To this...
Read more >
Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free
Top Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Yes, we need to remove all the hardcoded configuration values of tokenizers in the transformers source code, and upload
tokenizer_config.json
files for all those models.Also cc @n1t0
@pommedeterresautee
Hi, I am not sure it’s a
fast
tokenizers bug but maybe more a property that was (maybe unintentionnally) dropped from Tokenizers.Can you tell us what’s the actual bug for you in the end ? Just to make sure the fix I am working on will actually work as generally as possible