Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Cannot load saved tokenizer using AutoTokenizer

See original GitHub issue

Environment info

transformers version: 3.4.0
Platform: Win10 x64 (1607 Build 14393.3866)
Python version: 3.6.10
PyTorch version (GPU?): 1.5.1
Tensorflow version (GPU?): None
Using GPU in script?: No
Using distributed or parallel set-up in script?: No

Who can help

@mfuntowicz

Information

It appears that you can save a tokenizer to disk in a model agnostic way, but you cannot load it back in a model agnostic way. Is this a bug or by design?

To reproduce

Steps to reproduce the behavior:

tokenizer = AutoTokenizer.from_pretrained('roberta-base')
tokenizer.save_pretrained('TEST/tokenizer')

tokenizer = AutoTokenizer.from_pretrained('TEST/tokenizer')
# ERROR

The error you get is because the config argument is None, which means AutoTokenizer calls AutoConfig.from_pretrained, which utilises file_utils.CONFIG_NAME, however tokenizer.save_pretrained uses tokenization_utils_base.TOKENIZER_CONFIG_FILE instead, so they’re not compatible with one another.

Expected behavior

I would assume that calling AutoTokenizer.from_pretrained would be able to load and instantiate the correct model tokenizer without the user having to directly import the model tokenizer class first (e.g. RobertaTokenizer.from_pretrained). This would help a lot in moving to a model agnostic way of handling tokenizers, which I feel is the goal of the AutoTokenizer class. The fact that it can’t load a tokenizer from disk seems to be a bug, unless there is a different way of doing this?

Issue Analytics

State:
Created 3 years ago
Comments:5 (3 by maintainers)

Top GitHub Comments

1reaction

LysandreJikcommented, Oct 28, 2020

Hello! Indeed, I wouldn’t say this is a bug but more of a limitation of the AutoTokenizer class that has to rely on the model configuration in order to guess which tokenizer is affiliated with the model. Since you’re not interacting with the configuration in the configuration anywhere here, and, therefore, are not saving the model configuration in TEST/tokenizer, the AutoTokenizer cannot guess from which tokenizer to load.

One way to go around this limitation is to either specify the configuration when loading the tokenizer for the second time:

from transformers import AutoTokenizer, AutoConfig

tokenizer = AutoTokenizer.from_pretrained('roberta-base')
tokenizer.save_pretrained('TEST/tokenizer')

tokenizer = AutoTokenizer.from_pretrained('TEST/tokenizer', config=AutoConfig.from_pretrained("roberta-base"))

Another way would be to save the configuration in the initial folder:

from transformers import AutoTokenizer, AutoConfig

tokenizer = AutoTokenizer.from_pretrained('roberta-base')
config = AutoConfig.from_pretrained('roberta-base')

tokenizer.save_pretrained('TEST/tokenizer')
config.save_pretrained('TEST/tokenizer')

tokenizer = AutoTokenizer.from_pretrained('TEST/tokenizer')

In any case, the documentation about this should be improved.

0reactions

stale[bot]commented, Dec 31, 2020

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

Top Results From Across the Web

AutoTokenizer.from_pretrained fails to load locally saved ...

I am using pretrained tokenizers provided by HuggingFace. I am successful in downloading and running them. But if I try to save them...

What's the best way to load a saved Tokenizer json into a ...

I'm able to load the saved json into a tokenizers Tokenizer and it works ... You will only be able to load with...

PyTorch-Transformers

An open source machine learning framework that accelerates the path from research prototyping to production deployment.

Hugging Face Pre-trained Models: Find the Best One for Your ...

We will be using the pip command to install these libraries to use Hugging Face: ... from transformers import AutoTokenizer, AutoModelForSeq2SeqLM tokenizer ......

huggingface from_pretrained - You.com | The AI Search ...

Model hub: Can't load tokenizer using from_pretrained ... tokenizer = AutoTokenizer.from_pretrained('/appli/pretrainedModel/bart-large-mnli') But every time ...