question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

AutoTokenizer not loading gpt2 model on instance without internet connection even after caching model

See original GitHub issue

I am trying to first download and cache the GPT2 Tokenizer to use on an instance that does not have internet connection. I am able to download the tokenizer on my ec2 instance that does have an internet connection but when i copy over the directory to my instance that does not have the connection it gives a connection error.

The issue seems to be with only the tokenizer and not the model

Environment info

  • transformers version: 4.8.1
  • Platform: Linux-4.14.232-176.381.amzn2.x86_64-x86_64-with-glibc2.9
  • Python version: 3.6.10
  • PyTorch version (GPU?): 1.4.0 (True)
  • Tensorflow version (GPU?): not installed (NA)
  • Flax version (CPU?/GPU?/TPU?): not installed (NA)
  • Jax version: not installed
  • JaxLib version: not installed
  • Using GPU in script?: no
  • Using distributed or parallel set-up in script?: no

Who can help

Models:

Information

Tokenizer/Model I am using (GPT2, microsoft/DialogRPT-updown):

The problem arises when using:

  • the official example scripts: (give details below)

The tasks I am working on is:

  • my own task or dataset: (give details below)

To reproduce

Steps to reproduce the behavior:

  1. On my ec2 instance that has an internet connection I run
from transformers import GPT2Tokenizer
GPT2Tokenizer.from_pretrained("gpt2", cache_dir="<some_directory>")
  1. On my ec2 instance which does not have an internet connection I run the same command
from transformers import GPT2Tokenizer
GPT2Tokenizer.from_pretrained("gpt2", cache_dir="<some_directory>")

Traceback (most recent call last): File “<stdin>”, line 1, in <module> File “/home/ec2-user/anaconda3/envs/pytorch_p36/lib/python3.6/site-packages/transformers/tokenization_utils_base.py”, line 1680, in from_pretrained user_agent=user_agent, File “/home/ec2-user/anaconda3/envs/pytorch_p36/lib/python3.6/site-packages/transformers/file_utils.py”, line 1337, in cached_path local_files_only=local_files_only, File “/home/ec2-user/anaconda3/envs/pytorch_p36/lib/python3.6/site-packages/transformers/file_utils.py”, line 1553, in get_from_cache “Connection error, and we cannot find the requested files in the cached path.” ValueError: Connection error, and we cannot find the requested files in the cached path. Please try again or make sure your Internet connection is on.

Also does not work with AutoTokenizer

Expected behavior

After doing some digging it is looking for the added_tokens_file which does not exist. The vocab_file does exist.

Issue Analytics

  • State:closed
  • Created 2 years ago
  • Comments:13 (8 by maintainers)

github_iconTop GitHub Comments

1reaction
BramVanroycommented, Aug 23, 2021

@searchivarius local_files_only should indeed work. You can add it to your from_pretrained calls, e.g.

tok = AutoTokenizer.from_pretrained('roberta-base', unk_token='<unk>', local_files_only=True)

That’s the very hands-on, manual way to do this for each of your model, config, tokenizer inits. You can also set this globally. See https://github.com/huggingface/transformers/blob/master/docs/source/installation.md#offline-mode

1reaction
searchivariuscommented, Aug 23, 2021

Hi everybody, I am getting the same error and after digging a bit deeper, I believe that the current caching mechanism depends on the Internet connection crucially for latest versions, e.g., 4.8.x and 4.9.2. I blame the function get_from_cache, which IMHO shouldn’t work properly unless you always have Internet. Some details are below.

Simple code to reproduce the effect:

from transformers import AutoTokenizer, AutoModel
tok = AutoTokenizer.from_pretrained('roberta-base', unk_token='<unk>')

First, specifying the caching directory doesn’t help, because the function get_from_cache computes the caching path using the so-caled etag:

filename = url_to_filename(url, etag)

I added a code to print the filename, the url, and the etag. When Internet is there, we get:

### url: https://huggingface.co/roberta-base/resolve/main/config.json etag: "8db5e7ac5bfc9ec8b613b776009300fe3685d957" filename: 733bade19e5f0ce98e6531021dd5180994bb2f7b8bd7e80c7968805834ba351e.35205c6cfc956461d8515139f0f8dd5d207a2f336c0c3a83b4bc8dca3518e37b
### url: https://huggingface.co/roberta-base/resolve/main/vocab.json etag: "5606f48548d99a9829d10a96cd364b816b02cd21" filename: d3ccdbfeb9aaa747ef20432d4976c32ee3fa69663b379deb253ccfce2bb1fdc5.d67d6b367eb24ab43b08ad55e014cf254076934f71d832bbab9ad35644a375ab
### url: https://huggingface.co/roberta-base/resolve/main/merges.txt etag: "226b0752cac7789c48f0cb3ec53eda48b7be36cc" filename: cafdecc90fcab17011e12ac813dd574b4b3fea39da6dd817813efa010262ff3f.5d12962c5ee615a4c803841266e9c3be9a691a924f72d395d3a6c6c81157788b
### url: https://huggingface.co/roberta-base/resolve/main/tokenizer.json etag: "ad0bcbeb288f0d1373d88e0762e66357f55b8311" filename: d53fc0fa09b8342651efd4073d75e19617b3e51287c2a535becda5808a8db287.fc9576039592f026ad76a1c231b89aee8668488c671dfbe6616bab2ed298d730
### url: https://huggingface.co/roberta-base/resolve/main/config.json etag: "8db5e7ac5bfc9ec8b613b776009300fe3685d957" filename: 733bade19e5f0ce98e6531021dd5180994bb2f7b8bd7e80c7968805834ba351e.35205c6cfc956461d8515139f0f8dd5d207a2f336c0c3a83b4bc8dca3518e37b

Then, I have to disconnect the Internet. Now, the files are cached and should be accessed just fine.

So, we retry to create a tokenizer again, but it failes because without etag, we generate a very different filename:

### url: https://huggingface.co/roberta-base/resolve/main/tokenizer_config.json etag: None filename: dfe8f1ad04cb25b61a647e3d13620f9bf0a0f51d277897b232a5735297134132

The function get_from_cache has the parameter local_files_only. When, it’s true, etag is not computed. However, it is not clear how to use this to enable offline creation of resources after they have been downloaded once.

Thank you!

Read more comments on GitHub >

github_iconTop Results From Across the Web

OpenAI GPT2 - Hugging Face
Initializing with a config file does not load the weights associated with the model, only the configuration. Check out the from_pretrained() method to...
Read more >
HuggingFace | ValueError: Connection error, and we cannot ...
Since I am working in a conda venv and using Poetry for handling dependencies, I needed to re-install torch - a dependency for...
Read more >
Pretrain Transformers Models in PyTorch Using Hugging Face ...
This notebook is used to pretrain transformers models using Hugging Face on your own custom dataset. What do I mean by pretrain transformers ......
Read more >
Training GPT-2 for Generating Reviews from Title - Kaggle
Generating text with Fine-tuned GPT-2 model¶ ... loading weights file https://huggingface.co/gpt2/resolve/main/pytorch_model.bin from cache at / ...
Read more >
HOW TO USE TRANSFORMER FOR REAL LIFE PROBLEMS ...
Transformers are trained using a variant of language modeling, e.g. A masked language model for BERT, a causal language model for GPT-2 ......
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found