Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Cannot instantiate Tokenizer

See original GitHub issue

I am using Huggingface Transformers 4.0.0. When I instantiate the autotokenizer for indicbert, I get the following issue:

My code: tokenizer = AutoTokenizer.from_pretrained('ai4bharat/indic-bert')

Error: Couldn’t instantiate the backend tokenizer from one of: (1) a tokenizers library serialization file, (2) a slow tokenizer instance to convert or (3) an equivalent slow tokenizer class to instantiate and convert. You need to have sentencepiece installed to convert a slow tokenizer to a fast one.

Issue Analytics

State:
Created 3 years ago
Comments:12 (4 by maintainers)

Top GitHub Comments

7reactions

denocriscommented, Dec 16, 2020

Here the solution, sorry!

https://github.com/huggingface/transformers/releases/tag/v4.0.0

We must put use_fast=False in the tokenizer!

Thanks again!

6reactions

LauraIstcommented, Jan 7, 2021

Hi, I’m also having this problem. Trying to instantiate tokenizer = AutoTokenizer.from_pretrained("Helsinki-NLP/opus-mt-en-nl", use_fast=False) but I get “ValueError: This tokenizer cannot be instantiated. Please make sure you have sentencepiece installed in order to use this tokenizer.”

But I have already installed sentencepiece. I have:

pip: 20.3.3
sentencepiece: 0.1.94
transformers: 4.1.1

The above code snippet with “Musixmatch/umberto-wikipedia-uncased-v1” also doesn’t work for me.

Anyone have more ideas?

Top Results From Across the Web

Error with new tokenizers (URGENT!) - Hugging Face Forums

Couldn't instantiate the backend tokenizer from one of: (1) a tokenizers library serialization file, (2) a slow tokenizer instance to convert or (3)...

python | ValueError: Couldn't instantiate the backend tokenizer

First, I had to pip install sentencepiece . However, in the same code line, I was getting an error with sentencepiece .

tf.keras.preprocessing.text.Tokenizer | TensorFlow v2.11.0

Returns the tokenizer configuration as Python dictionary. The word count dictionaries used by the tokenizer get serialized into plain JSON, so ...

tokenize — Tokenizer for Python source — Python 3.11.1 ...

The tokenize module provides a lexical scanner for Python source code, implemented in Python. The scanner in this module returns comments as tokens...

Tokenizers | Apache Solr Reference Guide 6.6

The class attribute names a factory class that will instantiate a tokenizer object when needed. Tokenizer factory classes implement the ...