Cannot load saved tokenizer using AutoTokenizer
See original GitHub issueEnvironment info
transformers
version: 3.4.0- Platform: Win10 x64 (1607 Build 14393.3866)
- Python version: 3.6.10
- PyTorch version (GPU?): 1.5.1
- Tensorflow version (GPU?): None
- Using GPU in script?: No
- Using distributed or parallel set-up in script?: No
Who can help
Information
It appears that you can save a tokenizer to disk in a model agnostic way, but you cannot load it back in a model agnostic way. Is this a bug or by design?
To reproduce
Steps to reproduce the behavior:
tokenizer = AutoTokenizer.from_pretrained('roberta-base')
tokenizer.save_pretrained('TEST/tokenizer')
tokenizer = AutoTokenizer.from_pretrained('TEST/tokenizer')
# ERROR
The error you get is because the config argument is None, which means AutoTokenizer calls AutoConfig.from_pretrained, which utilises file_utils.CONFIG_NAME, however tokenizer.save_pretrained uses tokenization_utils_base.TOKENIZER_CONFIG_FILE instead, so they’re not compatible with one another.
Expected behavior
I would assume that calling AutoTokenizer.from_pretrained would be able to load and instantiate the correct model tokenizer without the user having to directly import the model tokenizer class first (e.g. RobertaTokenizer.from_pretrained). This would help a lot in moving to a model agnostic way of handling tokenizers, which I feel is the goal of the AutoTokenizer class. The fact that it can’t load a tokenizer from disk seems to be a bug, unless there is a different way of doing this?
Issue Analytics
- State:
- Created 3 years ago
- Comments:5 (3 by maintainers)
Hello! Indeed, I wouldn’t say this is a bug but more of a limitation of the
AutoTokenizer
class that has to rely on the model configuration in order to guess which tokenizer is affiliated with the model. Since you’re not interacting with the configuration in the configuration anywhere here, and, therefore, are not saving the model configuration inTEST/tokenizer
, the AutoTokenizer cannot guess from which tokenizer to load.One way to go around this limitation is to either specify the configuration when loading the tokenizer for the second time:
Another way would be to save the configuration in the initial folder:
In any case, the documentation about this should be improved.
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.