Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Cannot load reformer-enwik8 tokenizer

See original GitHub issue

🐛 Bug

Information

Model I am using (Bert, XLNet …): Reformer tokenizer

To reproduce

Steps to reproduce the behavior:

Try to load the pretrained reformer-enwik8 tokenizer with AutoTokenizer.from_pretrained("google/reformer-enwik8")

This is the error I got:

OSError                                   Traceback (most recent call last)
<ipython-input-51-ab9a64363cc0> in <module>
----> 1 AutoTokenizer.from_pretrained("google/reformer-enwik8")

~/.virtualenvs/sparseref/lib/python3.7/site-packages/transformers-2.9.0-py3.7.egg/transformers/tokenization_auto.py in from_pretrained(cls, pretrained_model_name_or_path, *inputs, **kwargs)
    198                     return tokenizer_class_fast.from_pretrained(pretrained_model_name_or_path, *inputs, **kwargs)
    199                 else:
--> 200                     return tokenizer_class_py.from_pretrained(pretrained_model_name_or_path, *inputs, **kwargs)
    201 
    202         raise ValueError(

~/.virtualenvs/sparseref/lib/python3.7/site-packages/transformers-2.9.0-py3.7.egg/transformers/tokenization_utils.py in from_pretrained(cls, *inputs, **kwargs)
    896 
    897         """
--> 898         return cls._from_pretrained(*inputs, **kwargs)
    899 
    900     @classmethod

~/.virtualenvs/sparseref/lib/python3.7/site-packages/transformers-2.9.0-py3.7.egg/transformers/tokenization_utils.py in _from_pretrained(cls, pretrained_model_name_or_path, *init_inputs, **kwargs)
   1001                     ", ".join(s3_models),
   1002                     pretrained_model_name_or_path,
-> 1003                     list(cls.vocab_files_names.values()),
   1004                 )
   1005             )

OSError: Model name 'google/reformer-enwik8' was not found in tokenizers model name list (google/reformer-crime-and-punishment). We assumed 'google/reformer-enwik8' was a path, a model identifier, or url to a directory containing vocabulary files named ['spiece.model'] but couldn't find such vocabulary files at this path or url.

I tried with and without google/, same result. However, it did print the download progress bar. Trying to load the crime-and-punishment Reformer tokenizer works.

transformers version: 2.9.0
Platform: macOS
Python version: 3.7
PyTorch version (GPU?): 1.4.0, no GPU
Using distributed or parallel set-up in script?: no

Issue Analytics

State:
Created 3 years ago
Reactions:1
Comments:5

Top GitHub Comments

2reactions

brataocommented, May 24, 2020

@erickrf can you share how you got to train the “reformer” model. I´m trying to utilize the “google/reformer-enwik8” to train a Portuguese model but I just got the same error of Model name 'google/reformer-enwik8' was not found in tokenizers

0reactions

BramVanroycommented, Mar 10, 2021

@LeopoldACC Please post a new issue so that some one can have a look.

Top Results From Across the Web

Cannot load saved tokenizer using AutoTokenizer · Issue #8125

Information. It appears that you can save a tokenizer to disk in a model agnostic way, but you cannot load it back in...

Hugging face tokenizer cannot load files properly

There is some error in huggingface code so i loaded the tokenizer like this and it worked. tokenizer = ByteLevelBPETokenizer('tokens/vocab.json' ...

Utilities for Tokenizers - Hugging Face

When the tokenizer is loaded with from_pretrained(), this will be set to the value stored for the associated model in max_model_input_sizes (see above)....

Understanding External Tokenization

External Tokenization enables accounts to tokenize data before loading it into Snowflake and ... Cannot apply a masking policy to a Snowflake feature....

Tokenizer reference | Elasticsearch Guide [8.5] | Elastic

For instance, a whitespace tokenizer breaks text into tokens whenever it sees any whitespace. It would convert the text "Quick brown fox!" into...