Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

AutoTokenizer not loading gpt2 model on instance without internet connection even after caching model

See original GitHub issue

I am trying to first download and cache the GPT2 Tokenizer to use on an instance that does not have internet connection. I am able to download the tokenizer on my ec2 instance that does have an internet connection but when i copy over the directory to my instance that does not have the connection it gives a connection error.

The issue seems to be with only the tokenizer and not the model

Environment info

transformers version: 4.8.1
Platform: Linux-4.14.232-176.381.amzn2.x86_64-x86_64-with-glibc2.9
Python version: 3.6.10
PyTorch version (GPU?): 1.4.0 (True)
Tensorflow version (GPU?): not installed (NA)
Flax version (CPU?/GPU?/TPU?): not installed (NA)
Jax version: not installed
JaxLib version: not installed
Using GPU in script?: no
Using distributed or parallel set-up in script?: no

Who can help

Models:

gpt2: @patrickvonplaten, @LysandreJik

Information

Tokenizer/Model I am using (GPT2, microsoft/DialogRPT-updown):

The problem arises when using:

the official example scripts: (give details below)

The tasks I am working on is:

my own task or dataset: (give details below)

To reproduce

Steps to reproduce the behavior:

On my ec2 instance that has an internet connection I run

from transformers import GPT2Tokenizer
GPT2Tokenizer.from_pretrained("gpt2", cache_dir="<some_directory>")

On my ec2 instance which does not have an internet connection I run the same command

from transformers import GPT2Tokenizer
GPT2Tokenizer.from_pretrained("gpt2", cache_dir="<some_directory>")

Traceback (most recent call last): File “<stdin>”, line 1, in <module> File “/home/ec2-user/anaconda3/envs/pytorch_p36/lib/python3.6/site-packages/transformers/tokenization_utils_base.py”, line 1680, in from_pretrained user_agent=user_agent, File “/home/ec2-user/anaconda3/envs/pytorch_p36/lib/python3.6/site-packages/transformers/file_utils.py”, line 1337, in cached_path local_files_only=local_files_only, File “/home/ec2-user/anaconda3/envs/pytorch_p36/lib/python3.6/site-packages/transformers/file_utils.py”, line 1553, in get_from_cache “Connection error, and we cannot find the requested files in the cached path.” ValueError: Connection error, and we cannot find the requested files in the cached path. Please try again or make sure your Internet connection is on.

Also does not work with AutoTokenizer

Expected behavior

After doing some digging it is looking for the added_tokens_file which does not exist. The vocab_file does exist.

Issue Analytics

State:
Created 2 years ago
Comments:13 (8 by maintainers)

Top GitHub Comments

1reaction

BramVanroycommented, Aug 23, 2021

@searchivarius local_files_only should indeed work. You can add it to your from_pretrained calls, e.g.

tok = AutoTokenizer.from_pretrained('roberta-base', unk_token='<unk>', local_files_only=True)

That’s the very hands-on, manual way to do this for each of your model, config, tokenizer inits. You can also set this globally. See https://github.com/huggingface/transformers/blob/master/docs/source/installation.md#offline-mode

1reaction

searchivariuscommented, Aug 23, 2021

Hi everybody, I am getting the same error and after digging a bit deeper, I believe that the current caching mechanism depends on the Internet connection crucially for latest versions, e.g., 4.8.x and 4.9.2. I blame the function get_from_cache, which IMHO shouldn’t work properly unless you always have Internet. Some details are below.

Simple code to reproduce the effect:

from transformers import AutoTokenizer, AutoModel
tok = AutoTokenizer.from_pretrained('roberta-base', unk_token='<unk>')

First, specifying the caching directory doesn’t help, because the function get_from_cache computes the caching path using the so-caled etag:

filename = url_to_filename(url, etag)

I added a code to print the filename, the url, and the etag. When Internet is there, we get:

### url: https://huggingface.co/roberta-base/resolve/main/config.json etag: "8db5e7ac5bfc9ec8b613b776009300fe3685d957" filename: 733bade19e5f0ce98e6531021dd5180994bb2f7b8bd7e80c7968805834ba351e.35205c6cfc956461d8515139f0f8dd5d207a2f336c0c3a83b4bc8dca3518e37b
### url: https://huggingface.co/roberta-base/resolve/main/vocab.json etag: "5606f48548d99a9829d10a96cd364b816b02cd21" filename: d3ccdbfeb9aaa747ef20432d4976c32ee3fa69663b379deb253ccfce2bb1fdc5.d67d6b367eb24ab43b08ad55e014cf254076934f71d832bbab9ad35644a375ab
### url: https://huggingface.co/roberta-base/resolve/main/merges.txt etag: "226b0752cac7789c48f0cb3ec53eda48b7be36cc" filename: cafdecc90fcab17011e12ac813dd574b4b3fea39da6dd817813efa010262ff3f.5d12962c5ee615a4c803841266e9c3be9a691a924f72d395d3a6c6c81157788b
### url: https://huggingface.co/roberta-base/resolve/main/tokenizer.json etag: "ad0bcbeb288f0d1373d88e0762e66357f55b8311" filename: d53fc0fa09b8342651efd4073d75e19617b3e51287c2a535becda5808a8db287.fc9576039592f026ad76a1c231b89aee8668488c671dfbe6616bab2ed298d730
### url: https://huggingface.co/roberta-base/resolve/main/config.json etag: "8db5e7ac5bfc9ec8b613b776009300fe3685d957" filename: 733bade19e5f0ce98e6531021dd5180994bb2f7b8bd7e80c7968805834ba351e.35205c6cfc956461d8515139f0f8dd5d207a2f336c0c3a83b4bc8dca3518e37b

Then, I have to disconnect the Internet. Now, the files are cached and should be accessed just fine.

So, we retry to create a tokenizer again, but it failes because without etag, we generate a very different filename:

### url: https://huggingface.co/roberta-base/resolve/main/tokenizer_config.json etag: None filename: dfe8f1ad04cb25b61a647e3d13620f9bf0a0f51d277897b232a5735297134132

The function get_from_cache has the parameter local_files_only. When, it’s true, etag is not computed. However, it is not clear how to use this to enable offline creation of resources after they have been downloaded once.

Thank you!

Top Results From Across the Web

OpenAI GPT2 - Hugging Face

Initializing with a config file does not load the weights associated with the model, only the configuration. Check out the from_pretrained() method to...

HuggingFace | ValueError: Connection error, and we cannot ...

Since I am working in a conda venv and using Poetry for handling dependencies, I needed to re-install torch - a dependency for...

Pretrain Transformers Models in PyTorch Using Hugging Face ...

This notebook is used to pretrain transformers models using Hugging Face on your own custom dataset. What do I mean by pretrain transformers ......

Training GPT-2 for Generating Reviews from Title - Kaggle

Generating text with Fine-tuned GPT-2 model¶ ... loading weights file https://huggingface.co/gpt2/resolve/main/pytorch_model.bin from cache at / ...

HOW TO USE TRANSFORMER FOR REAL LIFE PROBLEMS ...

Transformers are trained using a variant of language modeling, e.g. A masked language model for BERT, a causal language model for GPT-2 ......