AutoTokenizer not loading gpt2 model on instance without internet connection even after caching model
See original GitHub issueI am trying to first download and cache the GPT2 Tokenizer to use on an instance that does not have internet connection. I am able to download the tokenizer on my ec2 instance that does have an internet connection but when i copy over the directory to my instance that does not have the connection it gives a connection error.
The issue seems to be with only the tokenizer and not the model
Environment info
transformers
version: 4.8.1- Platform: Linux-4.14.232-176.381.amzn2.x86_64-x86_64-with-glibc2.9
- Python version: 3.6.10
- PyTorch version (GPU?): 1.4.0 (True)
- Tensorflow version (GPU?): not installed (NA)
- Flax version (CPU?/GPU?/TPU?): not installed (NA)
- Jax version: not installed
- JaxLib version: not installed
- Using GPU in script?: no
- Using distributed or parallel set-up in script?: no
Who can help
Models:
- gpt2: @patrickvonplaten, @LysandreJik
Information
Tokenizer/Model I am using (GPT2, microsoft/DialogRPT-updown):
The problem arises when using:
- the official example scripts: (give details below)
The tasks I am working on is:
- my own task or dataset: (give details below)
To reproduce
Steps to reproduce the behavior:
- On my ec2 instance that has an internet connection I run
from transformers import GPT2Tokenizer
GPT2Tokenizer.from_pretrained("gpt2", cache_dir="<some_directory>")
- On my ec2 instance which does not have an internet connection I run the same command
from transformers import GPT2Tokenizer
GPT2Tokenizer.from_pretrained("gpt2", cache_dir="<some_directory>")
Traceback (most recent call last): File “<stdin>”, line 1, in <module> File “/home/ec2-user/anaconda3/envs/pytorch_p36/lib/python3.6/site-packages/transformers/tokenization_utils_base.py”, line 1680, in from_pretrained user_agent=user_agent, File “/home/ec2-user/anaconda3/envs/pytorch_p36/lib/python3.6/site-packages/transformers/file_utils.py”, line 1337, in cached_path local_files_only=local_files_only, File “/home/ec2-user/anaconda3/envs/pytorch_p36/lib/python3.6/site-packages/transformers/file_utils.py”, line 1553, in get_from_cache “Connection error, and we cannot find the requested files in the cached path.” ValueError: Connection error, and we cannot find the requested files in the cached path. Please try again or make sure your Internet connection is on.
Also does not work with AutoTokenizer
Expected behavior
After doing some digging it is looking for the added_tokens_file which does not exist. The vocab_file does exist.
Issue Analytics
- State:
- Created 2 years ago
- Comments:13 (8 by maintainers)
Top GitHub Comments
@searchivarius
local_files_only
should indeed work. You can add it to your from_pretrained calls, e.g.That’s the very hands-on, manual way to do this for each of your model, config, tokenizer inits. You can also set this globally. See https://github.com/huggingface/transformers/blob/master/docs/source/installation.md#offline-mode
Hi everybody, I am getting the same error and after digging a bit deeper, I believe that the current caching mechanism depends on the Internet connection crucially for latest versions, e.g., 4.8.x and 4.9.2. I blame the function
get_from_cache
, which IMHO shouldn’t work properly unless you always have Internet. Some details are below.Simple code to reproduce the effect:
First, specifying the caching directory doesn’t help, because the function
get_from_cache
computes the caching path using the so-caledetag
:I added a code to print the filename, the url, and the etag. When Internet is there, we get:
Then, I have to disconnect the Internet. Now, the files are cached and should be accessed just fine.
So, we retry to create a tokenizer again, but it failes because without etag, we generate a very different filename:
The function
get_from_cache
has the parameter local_files_only. When, it’s true, etag is not computed. However, it is not clear how to use this to enable offline creation of resources after they have been downloaded once.Thank you!