Cannot load reformer-enwik8 tokenizer
See original GitHub issue🐛 Bug
Information
Model I am using (Bert, XLNet …): Reformer tokenizer
To reproduce
Steps to reproduce the behavior:
- Try to load the pretrained reformer-enwik8 tokenizer with
AutoTokenizer.from_pretrained("google/reformer-enwik8")
This is the error I got:
OSError Traceback (most recent call last)
<ipython-input-51-ab9a64363cc0> in <module>
----> 1 AutoTokenizer.from_pretrained("google/reformer-enwik8")
~/.virtualenvs/sparseref/lib/python3.7/site-packages/transformers-2.9.0-py3.7.egg/transformers/tokenization_auto.py in from_pretrained(cls, pretrained_model_name_or_path, *inputs, **kwargs)
198 return tokenizer_class_fast.from_pretrained(pretrained_model_name_or_path, *inputs, **kwargs)
199 else:
--> 200 return tokenizer_class_py.from_pretrained(pretrained_model_name_or_path, *inputs, **kwargs)
201
202 raise ValueError(
~/.virtualenvs/sparseref/lib/python3.7/site-packages/transformers-2.9.0-py3.7.egg/transformers/tokenization_utils.py in from_pretrained(cls, *inputs, **kwargs)
896
897 """
--> 898 return cls._from_pretrained(*inputs, **kwargs)
899
900 @classmethod
~/.virtualenvs/sparseref/lib/python3.7/site-packages/transformers-2.9.0-py3.7.egg/transformers/tokenization_utils.py in _from_pretrained(cls, pretrained_model_name_or_path, *init_inputs, **kwargs)
1001 ", ".join(s3_models),
1002 pretrained_model_name_or_path,
-> 1003 list(cls.vocab_files_names.values()),
1004 )
1005 )
OSError: Model name 'google/reformer-enwik8' was not found in tokenizers model name list (google/reformer-crime-and-punishment). We assumed 'google/reformer-enwik8' was a path, a model identifier, or url to a directory containing vocabulary files named ['spiece.model'] but couldn't find such vocabulary files at this path or url.
I tried with and without google/
, same result. However, it did print the download progress bar. Trying to load the crime-and-punishment
Reformer tokenizer works.
transformers
version: 2.9.0- Platform: macOS
- Python version: 3.7
- PyTorch version (GPU?): 1.4.0, no GPU
- Using distributed or parallel set-up in script?: no
Issue Analytics
- State:
- Created 3 years ago
- Reactions:1
- Comments:5
Top Results From Across the Web
Cannot load saved tokenizer using AutoTokenizer · Issue #8125
Information. It appears that you can save a tokenizer to disk in a model agnostic way, but you cannot load it back in...
Read more >Hugging face tokenizer cannot load files properly
There is some error in huggingface code so i loaded the tokenizer like this and it worked. tokenizer = ByteLevelBPETokenizer('tokens/vocab.json' ...
Read more >Utilities for Tokenizers - Hugging Face
When the tokenizer is loaded with from_pretrained(), this will be set to the value stored for the associated model in max_model_input_sizes (see above)....
Read more >Understanding External Tokenization
External Tokenization enables accounts to tokenize data before loading it into Snowflake and ... Cannot apply a masking policy to a Snowflake feature....
Read more >Tokenizer reference | Elasticsearch Guide [8.5] | Elastic
For instance, a whitespace tokenizer breaks text into tokens whenever it sees any whitespace. It would convert the text "Quick brown fox!" into...
Read more >Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start FreeTop Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Top GitHub Comments
@erickrf can you share how you got to train the “reformer” model. I´m trying to utilize the “google/reformer-enwik8” to train a Portuguese model but I just got the same error of
Model name 'google/reformer-enwik8' was not found in tokenizers
@LeopoldACC Please post a new issue so that some one can have a look.