question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Cannot load saved tokenizer using AutoTokenizer

See original GitHub issue

Environment info

  • transformers version: 3.4.0
  • Platform: Win10 x64 (1607 Build 14393.3866)
  • Python version: 3.6.10
  • PyTorch version (GPU?): 1.5.1
  • Tensorflow version (GPU?): None
  • Using GPU in script?: No
  • Using distributed or parallel set-up in script?: No

Who can help

@mfuntowicz

Information

It appears that you can save a tokenizer to disk in a model agnostic way, but you cannot load it back in a model agnostic way. Is this a bug or by design?

To reproduce

Steps to reproduce the behavior:

tokenizer = AutoTokenizer.from_pretrained('roberta-base')
tokenizer.save_pretrained('TEST/tokenizer')

tokenizer = AutoTokenizer.from_pretrained('TEST/tokenizer')
# ERROR

The error you get is because the config argument is None, which means AutoTokenizer calls AutoConfig.from_pretrained, which utilises file_utils.CONFIG_NAME, however tokenizer.save_pretrained uses tokenization_utils_base.TOKENIZER_CONFIG_FILE instead, so they’re not compatible with one another.

Expected behavior

I would assume that calling AutoTokenizer.from_pretrained would be able to load and instantiate the correct model tokenizer without the user having to directly import the model tokenizer class first (e.g. RobertaTokenizer.from_pretrained). This would help a lot in moving to a model agnostic way of handling tokenizers, which I feel is the goal of the AutoTokenizer class. The fact that it can’t load a tokenizer from disk seems to be a bug, unless there is a different way of doing this?

Issue Analytics

  • State:closed
  • Created 3 years ago
  • Comments:5 (3 by maintainers)

github_iconTop GitHub Comments

1reaction
LysandreJikcommented, Oct 28, 2020

Hello! Indeed, I wouldn’t say this is a bug but more of a limitation of the AutoTokenizer class that has to rely on the model configuration in order to guess which tokenizer is affiliated with the model. Since you’re not interacting with the configuration in the configuration anywhere here, and, therefore, are not saving the model configuration in TEST/tokenizer, the AutoTokenizer cannot guess from which tokenizer to load.

One way to go around this limitation is to either specify the configuration when loading the tokenizer for the second time:

from transformers import AutoTokenizer, AutoConfig

tokenizer = AutoTokenizer.from_pretrained('roberta-base')
tokenizer.save_pretrained('TEST/tokenizer')

tokenizer = AutoTokenizer.from_pretrained('TEST/tokenizer', config=AutoConfig.from_pretrained("roberta-base"))

Another way would be to save the configuration in the initial folder:

from transformers import AutoTokenizer, AutoConfig

tokenizer = AutoTokenizer.from_pretrained('roberta-base')
config = AutoConfig.from_pretrained('roberta-base')

tokenizer.save_pretrained('TEST/tokenizer')
config.save_pretrained('TEST/tokenizer')

tokenizer = AutoTokenizer.from_pretrained('TEST/tokenizer')

In any case, the documentation about this should be improved.

0reactions
stale[bot]commented, Dec 31, 2020

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

Read more comments on GitHub >

github_iconTop Results From Across the Web

AutoTokenizer.from_pretrained fails to load locally saved ...
I am using pretrained tokenizers provided by HuggingFace. I am successful in downloading and running them. But if I try to save them...
Read more >
What's the best way to load a saved Tokenizer json into a ...
I'm able to load the saved json into a tokenizers Tokenizer and it works ... You will only be able to load with...
Read more >
PyTorch-Transformers
An open source machine learning framework that accelerates the path from research prototyping to production deployment.
Read more >
Hugging Face Pre-trained Models: Find the Best One for Your ...
We will be using the pip command to install these libraries to use Hugging Face: ... from transformers import AutoTokenizer, AutoModelForSeq2SeqLM tokenizer ......
Read more >
huggingface from_pretrained - You.com | The AI Search ...
Model hub: Can't load tokenizer using from_pretrained ... tokenizer = AutoTokenizer.from_pretrained('/appli/pretrainedModel/bart-large-mnli') But every time ...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found