'use_fast=True' results in 'TypeError' when trying to save tokenizer via AutoTokenizer
See original GitHub issue🐛 Bug
Information
Model I am using: all/any
Language I am using the model on: English
The problem arises when using:
- the official example scripts:
AutoTokenizer.from_pretrained([model], use_fast=True)
After updating to Transformers v2.10.0, when setting use_fast=True
as in tokenizer = AutoTokenizer.from_pretrained('bert-base-uncased', use_fast=True)
, when trying to save the model by using tokenizer.save_pretrained(path)
I get the following error and the process quits:
File "../python3.6/site-packages/transformers/tokenization_utils.py", line 1117,
in save_pretrained
vocab_files = self.save_vocabulary(save_directory)
File "../python3.6/site-packages/transformers/tokenization_utils.py", line 2657,
in save_vocabulary
files = self._tokenizer.save(save_directory)
File "../python3.6/site-packages/tokenizers/implementations/base_tokenizer.py",
line 328, in save
return self._tokenizer.model.save(directory, name=name)
TypeError
When I omit the use_fast=True
flag, the tokenizer saves fine.
The tasks I am working on is:
- my own task or dataset: Text classification
To reproduce
Steps to reproduce the behavior:
- Upgrade to
transformers==2.10.0
(requirestokenizers==0.7.0
) - Load a tokenizer using
AutoTokenizer.from_pretrained()
with flaguse_fast=True
- Train for one epoch on any dataset, then try to save the tokenizer.
Expected behavior
The tokenizer/file should save into the chosen path, as it does with the regular tokenizer.
Environment info
transformers
version: 2.10.0- Platform: Linux-5.0.0-37-generic-x86_64-with-Ubuntu-18.04-bionic
- Python version: 3.6.8
- PyTorch version (GPU?): 1.4.0+cu100 (True)
- Tensorflow version (GPU?): not installed (NA)
- Using GPU in script?: yes
- Using distributed or parallel set-up in script?: no
Issue Analytics
- State:
- Created 3 years ago
- Reactions:1
- Comments:7 (3 by maintainers)
Top Results From Across the Web
Huggingface saving tokenizer - Stack Overflow
Tried looking into it. It seems like a bug. And as you have figured out it saves tokenizer_config.json and expects config.json. – Ashwin...
Read more >Use tokenizers from Tokenizers - Hugging Face
We now have a tokenizer trained on the files we defined. We can either continue using it in that runtime, or save it...
Read more >How to use BERT from the Hugging Face transformer library
In this article, I will demonstrate how to use BERT using the Hugging Face ... You can download the tokenizer using this line...
Read more >Tokenizer — PySpark 3.3.1 documentation - Apache Spark
TypeError : Method setParams forces keyword arguments. >>> tokenizerPath = temp_path + "/tokenizer" >>> tokenizer.save(tokenizerPath) >>> loadedTokenizer ...
Read more >tf.keras.preprocessing.text.Tokenizer | TensorFlow v2.11.0
Required before using sequences_to_matrix (if fit_on_texts was never called). Args. sequences, A list of sequence. A "sequence" is a list ...
Read more >Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start FreeTop Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Top GitHub Comments
The path I am trying to save to is
"output/model_out"
- but it’s generated usingPath()
, in case that makes a difference (not sure why it would make a difference for saving thefast
tokenizer and not the regular one though).For what it is worth, I have this exact same issue with the
"distilroberta-base"
tokenizer whenuse_fast=True
. Casting myPath
object to astr
solved the issue.