question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

'use_fast=True' results in 'TypeError' when trying to save tokenizer via AutoTokenizer

See original GitHub issue

🐛 Bug

Information

Model I am using: all/any

Language I am using the model on: English

The problem arises when using:

  • the official example scripts: AutoTokenizer.from_pretrained([model], use_fast=True)

After updating to Transformers v2.10.0, when setting use_fast=True as in tokenizer = AutoTokenizer.from_pretrained('bert-base-uncased', use_fast=True), when trying to save the model by using tokenizer.save_pretrained(path) I get the following error and the process quits:

File "../python3.6/site-packages/transformers/tokenization_utils.py", line 1117,
  in save_pretrained
    vocab_files = self.save_vocabulary(save_directory)
  File "../python3.6/site-packages/transformers/tokenization_utils.py", line 2657,
  in save_vocabulary
    files = self._tokenizer.save(save_directory)
  File "../python3.6/site-packages/tokenizers/implementations/base_tokenizer.py",
  line 328, in save
    return self._tokenizer.model.save(directory, name=name)
TypeError

When I omit the use_fast=True flag, the tokenizer saves fine.

The tasks I am working on is:

  • my own task or dataset: Text classification

To reproduce

Steps to reproduce the behavior:

  1. Upgrade to transformers==2.10.0 (requires tokenizers==0.7.0)
  2. Load a tokenizer using AutoTokenizer.from_pretrained() with flag use_fast=True
  3. Train for one epoch on any dataset, then try to save the tokenizer.

Expected behavior

The tokenizer/file should save into the chosen path, as it does with the regular tokenizer.

Environment info

  • transformers version: 2.10.0
  • Platform: Linux-5.0.0-37-generic-x86_64-with-Ubuntu-18.04-bionic
  • Python version: 3.6.8
  • PyTorch version (GPU?): 1.4.0+cu100 (True)
  • Tensorflow version (GPU?): not installed (NA)
  • Using GPU in script?: yes
  • Using distributed or parallel set-up in script?: no

Issue Analytics

  • State:closed
  • Created 3 years ago
  • Reactions:1
  • Comments:7 (3 by maintainers)

github_iconTop GitHub Comments

1reaction
lingdoccommented, May 25, 2020

The path I am trying to save to is "output/model_out" - but it’s generated using Path(), in case that makes a difference (not sure why it would make a difference for saving the fast tokenizer and not the regular one though).

0reactions
JohnGiorgicommented, Jun 25, 2020

For what it is worth, I have this exact same issue with the "distilroberta-base" tokenizer when use_fast=True. Casting my Path object to a str solved the issue.

Read more comments on GitHub >

github_iconTop Results From Across the Web

Huggingface saving tokenizer - Stack Overflow
Tried looking into it. It seems like a bug. And as you have figured out it saves tokenizer_config.json and expects config.json. – Ashwin...
Read more >
Use tokenizers from Tokenizers - Hugging Face
We now have a tokenizer trained on the files we defined. We can either continue using it in that runtime, or save it...
Read more >
How to use BERT from the Hugging Face transformer library
In this article, I will demonstrate how to use BERT using the Hugging Face ... You can download the tokenizer using this line...
Read more >
Tokenizer — PySpark 3.3.1 documentation - Apache Spark
TypeError : Method setParams forces keyword arguments. >>> tokenizerPath = temp_path + "/tokenizer" >>> tokenizer.save(tokenizerPath) >>> loadedTokenizer ...
Read more >
tf.keras.preprocessing.text.Tokenizer | TensorFlow v2.11.0
Required before using sequences_to_matrix (if fit_on_texts was never called). Args. sequences, A list of sequence. A "sequence" is a list ...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found