Tokenizers bug: version 2.10 doesn't honor `max_len` when instantiating a pretrained model
See original GitHub issue🐛 Bug
Information
Hello! I’ve just upgraded from Transformers 2.8 to Transformers 2.10, and noticed that parameter max_len
is not properly honored when instantiating a pretrained model. For example, in Transformer 2.8.0, I was able to limit the length of a tokenized sequence as follows:
import transformers
>>> tok = transformers.RobertaTokenizer.from_pretrained('roberta-base', max_len=16)
>>> tok.encode('This is a sentence', pad_to_max_length=True)
[0, 152, 16, 10, 3645, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]
>>>print(tok.max_len)
16
However, on version 2.10, max_len
is ignored when loading a pretrained tokenizer:
import transformers
>>> tok = transformers.RobertaTokenizer.from_pretrained('roberta-base', max_len=16)
>>> tok.encode('This is a sentence', pad_to_max_length=True)
[0, 152, 16, 10, 3645, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...] # 512 tokens
>>>print(tok.max_len)
512
This bug can be temporary solved by using model_max_length
instead of max_len
, but it broke the all my scripts that relied on that attribute. It seems that this issue was introduced in a recent change in tokenization_utils.py
(line 825):
# For backward compatibility we fallback to set model_max_length from max_len if provided
model_max_length = model_max_length if model_max_length is not None else kwargs.pop("max_len", None)
This compatibility is not guaranteed if the pretrained model contains model_max_length
among its parameters, but max_len
is specified in from_pretrained
.
Model I am using (Bert, XLNet …): As far as I can tell, this affects all pretrained models. Observed on BERT, RoBERTa, and DistilBERT.
Language I am using the model on (English, Chinese …): As far as I can tell, this affects all pretrained models. Observed on English.
The problem arises when using:
- the official example scripts
- my own modified scripts: See above.
The tasks I am working on is:
- an official GLUE/SQUaD task:
- my own task or dataset: It’s a classification task.
To reproduce
See above.
Expected behavior
See above.
Environment info
transformers
version: 2.10.0- Platform: Linux-4.15.0-1060-aws-x86_64-with-debian-buster-sid
- Python version: 3.6.5
- PyTorch version (GPU?): 1.4.0 (True)
- Tensorflow version (GPU?): not installed (NA)
- Using GPU in script?: yes; 4 x Tesla V100
- Using distributed or parallel set-up in script?: parallel, but not relevant
Issue Analytics
- State:
- Created 3 years ago
- Comments:6 (3 by maintainers)
Top GitHub Comments
@soldni I’ve fixed the issue when both are provided in the same PR, it will be included in the next patch release.
Thanks for reporting! I’m closing, feel free to reopen if needed 👍 Morgan
Awesome! In the meantime, I’ve updated my code as you recommended.
Thanks again for the super quick response on this.
-Luca