question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Cannot load reformer-enwik8 tokenizer

See original GitHub issue

🐛 Bug

Information

Model I am using (Bert, XLNet …): Reformer tokenizer

To reproduce

Steps to reproduce the behavior:

  1. Try to load the pretrained reformer-enwik8 tokenizer with AutoTokenizer.from_pretrained("google/reformer-enwik8")

This is the error I got:

OSError                                   Traceback (most recent call last)
<ipython-input-51-ab9a64363cc0> in <module>
----> 1 AutoTokenizer.from_pretrained("google/reformer-enwik8")

~/.virtualenvs/sparseref/lib/python3.7/site-packages/transformers-2.9.0-py3.7.egg/transformers/tokenization_auto.py in from_pretrained(cls, pretrained_model_name_or_path, *inputs, **kwargs)
    198                     return tokenizer_class_fast.from_pretrained(pretrained_model_name_or_path, *inputs, **kwargs)
    199                 else:
--> 200                     return tokenizer_class_py.from_pretrained(pretrained_model_name_or_path, *inputs, **kwargs)
    201 
    202         raise ValueError(

~/.virtualenvs/sparseref/lib/python3.7/site-packages/transformers-2.9.0-py3.7.egg/transformers/tokenization_utils.py in from_pretrained(cls, *inputs, **kwargs)
    896 
    897         """
--> 898         return cls._from_pretrained(*inputs, **kwargs)
    899 
    900     @classmethod

~/.virtualenvs/sparseref/lib/python3.7/site-packages/transformers-2.9.0-py3.7.egg/transformers/tokenization_utils.py in _from_pretrained(cls, pretrained_model_name_or_path, *init_inputs, **kwargs)
   1001                     ", ".join(s3_models),
   1002                     pretrained_model_name_or_path,
-> 1003                     list(cls.vocab_files_names.values()),
   1004                 )
   1005             )

OSError: Model name 'google/reformer-enwik8' was not found in tokenizers model name list (google/reformer-crime-and-punishment). We assumed 'google/reformer-enwik8' was a path, a model identifier, or url to a directory containing vocabulary files named ['spiece.model'] but couldn't find such vocabulary files at this path or url.

I tried with and without google/, same result. However, it did print the download progress bar. Trying to load the crime-and-punishment Reformer tokenizer works.

  • transformers version: 2.9.0
  • Platform: macOS
  • Python version: 3.7
  • PyTorch version (GPU?): 1.4.0, no GPU
  • Using distributed or parallel set-up in script?: no

Issue Analytics

  • State:closed
  • Created 3 years ago
  • Reactions:1
  • Comments:5

github_iconTop GitHub Comments

2reactions
brataocommented, May 24, 2020

@erickrf can you share how you got to train the “reformer” model. I´m trying to utilize the “google/reformer-enwik8” to train a Portuguese model but I just got the same error of Model name 'google/reformer-enwik8' was not found in tokenizers

0reactions
BramVanroycommented, Mar 10, 2021

@LeopoldACC Please post a new issue so that some one can have a look.

Read more comments on GitHub >

github_iconTop Results From Across the Web

Cannot load saved tokenizer using AutoTokenizer · Issue #8125
Information. It appears that you can save a tokenizer to disk in a model agnostic way, but you cannot load it back in...
Read more >
Hugging face tokenizer cannot load files properly
There is some error in huggingface code so i loaded the tokenizer like this and it worked. tokenizer = ByteLevelBPETokenizer('tokens/vocab.json' ...
Read more >
Utilities for Tokenizers - Hugging Face
When the tokenizer is loaded with from_pretrained(), this will be set to the value stored for the associated model in max_model_input_sizes (see above)....
Read more >
Understanding External Tokenization
External Tokenization enables accounts to tokenize data before loading it into Snowflake and ... Cannot apply a masking policy to a Snowflake feature....
Read more >
Tokenizer reference | Elasticsearch Guide [8.5] | Elastic
For instance, a whitespace tokenizer breaks text into tokens whenever it sees any whitespace. It would convert the text "Quick brown fox!" into...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found