Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

'use_fast=True' results in 'TypeError' when trying to save tokenizer via AutoTokenizer

See original GitHub issue

🐛 Bug

Information

Model I am using: all/any

Language I am using the model on: English

The problem arises when using:

the official example scripts: AutoTokenizer.from_pretrained([model], use_fast=True)

After updating to Transformers v2.10.0, when setting use_fast=True as in tokenizer = AutoTokenizer.from_pretrained('bert-base-uncased', use_fast=True), when trying to save the model by using tokenizer.save_pretrained(path) I get the following error and the process quits:

File "../python3.6/site-packages/transformers/tokenization_utils.py", line 1117,
  in save_pretrained
    vocab_files = self.save_vocabulary(save_directory)
  File "../python3.6/site-packages/transformers/tokenization_utils.py", line 2657,
  in save_vocabulary
    files = self._tokenizer.save(save_directory)
  File "../python3.6/site-packages/tokenizers/implementations/base_tokenizer.py",
  line 328, in save
    return self._tokenizer.model.save(directory, name=name)
TypeError

When I omit the use_fast=True flag, the tokenizer saves fine.

The tasks I am working on is:

my own task or dataset: Text classification

To reproduce

Steps to reproduce the behavior:

Upgrade to transformers==2.10.0 (requires tokenizers==0.7.0)
Load a tokenizer using AutoTokenizer.from_pretrained() with flag use_fast=True
Train for one epoch on any dataset, then try to save the tokenizer.

Expected behavior

The tokenizer/file should save into the chosen path, as it does with the regular tokenizer.

Environment info

transformers version: 2.10.0
Platform: Linux-5.0.0-37-generic-x86_64-with-Ubuntu-18.04-bionic
Python version: 3.6.8
PyTorch version (GPU?): 1.4.0+cu100 (True)
Tensorflow version (GPU?): not installed (NA)
Using GPU in script?: yes
Using distributed or parallel set-up in script?: no

Issue Analytics

State:
Created 3 years ago
Reactions:1
Comments:7 (3 by maintainers)

Top GitHub Comments

1reaction

lingdoccommented, May 25, 2020

The path I am trying to save to is "output/model_out" - but it’s generated using Path(), in case that makes a difference (not sure why it would make a difference for saving the fast tokenizer and not the regular one though).

0reactions

JohnGiorgicommented, Jun 25, 2020

For what it is worth, I have this exact same issue with the "distilroberta-base" tokenizer when use_fast=True. Casting my Path object to a str solved the issue.

Top Results From Across the Web

Huggingface saving tokenizer - Stack Overflow

Tried looking into it. It seems like a bug. And as you have figured out it saves tokenizer_config.json and expects config.json. – Ashwin...

Use tokenizers from Tokenizers - Hugging Face

We now have a tokenizer trained on the files we defined. We can either continue using it in that runtime, or save it...

How to use BERT from the Hugging Face transformer library

In this article, I will demonstrate how to use BERT using the Hugging Face ... You can download the tokenizer using this line...

Tokenizer — PySpark 3.3.1 documentation - Apache Spark

TypeError : Method setParams forces keyword arguments. >>> tokenizerPath = temp_path + "/tokenizer" >>> tokenizer.save(tokenizerPath) >>> loadedTokenizer ...

tf.keras.preprocessing.text.Tokenizer | TensorFlow v2.11.0

Required before using sequences_to_matrix (if fit_on_texts was never called). Args. sequences, A list of sequence. A "sequence" is a list ...