question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

RagRetriever.from_pretrained doesn't get another cache_dir.

See original GitHub issue

Environment info

  • transformers version: 3.3.1
  • Platform: Linux-4.19
  • Python version: 3.7.7
  • PyTorch version (GPU?): 1.6.0
  • Tensorflow version (GPU?): No
  • Using GPU in script?: Yes
  • Using distributed or parallel set-up in script?: No

Who can help

@VictorSanh

Information

Model I am using RAG:

The problem arises when using:

  • the official example scripts: (give details below)

To reproduce

Steps to reproduce the behavior:

  1. Open notebook
  2. Run the example code changing the ‘TRANSFORMERS_CACHE’ path to place the dataset in another place than the default one
import os
os.environ['TRANSFORMERS_CACHE'] = '/workspace/notebooks/POCs/cache'

from transformers import RagTokenizer, RagRetriever, RagTokenForGeneration

tokenizer = RagTokenizer.from_pretrained("facebook/rag-token-nq") # Here the data is placed in the expected path /workspace...

retriever = RagRetriever.from_pretrained("facebook/rag-token-nq", index_name="exact", use_dummy_dataset=False) # The dataset is placed in the default place /root/.cache/huggingface/datasets/wiki_dpr/psgs_w100.nq.no_index/0.0.0/

Expected behavior

RagRetriever.from_pretrained("facebook/rag-token-nq", index_name="exact", use_dummy_dataset=False) should place the data in the expected patch ‘/workspace/notebooks/POCs/cache’

I tried with as well with: retriever = RagRetriever.from_pretrained("facebook/rag-token-nq", index_name="exact", chache_dir='/workspace/notebooks/POCs/cache' use_dummy_dataset=False) but it doesn’t work neither.

Issue Analytics

  • State:closed
  • Created 3 years ago
  • Comments:6 (3 by maintainers)

github_iconTop GitHub Comments

2reactions
josemlopezcommented, Oct 6, 2020

Hi @lhoestq ,

In the meantime you can specify HF_DATASETS_CACHE to tell where to store the dataset used by RAG for retrieval

HF_DATASETS_CACHE works fine:

retriever = RagRetriever.from_pretrained("facebook/rag-token-nq", index_name="exact", use_dummy_dataset=False)

Using custom data configuration psgs_w100.nq.no_index
Reusing dataset wiki_dpr (/my_cache/cache/wiki_dpr/psgs_w100.nq.no_index/0.0.0/14b973bf2a456087ff69c0fd34526684eed22e48e0dfce4338f9a22b965ce7c2)
Using custom data configuration psgs_w100.nq.exact

Downloading and preparing dataset wiki_dpr/psgs_w100.nq.exact (download: Unknown size, generated: Unknown size, post-processed: Unknown size, total: Unknown size) to /my_cache/cache/wiki_dpr/psgs_w100.nq.exact/0.0.0/14b973bf2a456087ff69c0fd34526684eed22e48e0dfce4338f9a22b965ce7c2...

Could you create an issue on the datasets repo ? this seems unrelated

sure, I’ll post the other issue in the datasets repo.

Thanks!

0reactions
stale[bot]commented, Dec 11, 2020

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

Read more comments on GitHub >

github_iconTop Results From Across the Web

How to change huggingface transformers default cache directory
You can specify the cache directory everytime you load a model with .from_pretrained by the setting the parameter cache_dir .
Read more >
RAG - Hugging Face
We introduce RAG models where the parametric memory is a pre-trained seq2seq model and the non-parametric memory is a dense vector index of...
Read more >
How to change huggingface transformers default cache directory
I'm writing this answer because there are other Hugging Face cache directories that also eat space in the home directory besides the model...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found