Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

RagRetriever.from_pretrained doesn't get another cache_dir.

See original GitHub issue

Environment info

transformers version: 3.3.1
Platform: Linux-4.19
Python version: 3.7.7
PyTorch version (GPU?): 1.6.0
Tensorflow version (GPU?): No
Using GPU in script?: Yes
Using distributed or parallel set-up in script?: No

Who can help

@VictorSanh

Information

Model I am using RAG:

The problem arises when using:

the official example scripts: (give details below)

To reproduce

Steps to reproduce the behavior:

Open notebook
Run the example code changing the ‘TRANSFORMERS_CACHE’ path to place the dataset in another place than the default one

import os
os.environ['TRANSFORMERS_CACHE'] = '/workspace/notebooks/POCs/cache'

from transformers import RagTokenizer, RagRetriever, RagTokenForGeneration

tokenizer = RagTokenizer.from_pretrained("facebook/rag-token-nq") # Here the data is placed in the expected path /workspace...

retriever = RagRetriever.from_pretrained("facebook/rag-token-nq", index_name="exact", use_dummy_dataset=False) # The dataset is placed in the default place /root/.cache/huggingface/datasets/wiki_dpr/psgs_w100.nq.no_index/0.0.0/

Expected behavior

RagRetriever.from_pretrained("facebook/rag-token-nq", index_name="exact", use_dummy_dataset=False) should place the data in the expected patch ‘/workspace/notebooks/POCs/cache’

I tried with as well with: retriever = RagRetriever.from_pretrained("facebook/rag-token-nq", index_name="exact", chache_dir='/workspace/notebooks/POCs/cache' use_dummy_dataset=False) but it doesn’t work neither.

Issue Analytics

State:
Created 3 years ago
Comments:6 (3 by maintainers)

Top GitHub Comments

2reactions

josemlopezcommented, Oct 6, 2020

Hi @lhoestq ,

In the meantime you can specify HF_DATASETS_CACHE to tell where to store the dataset used by RAG for retrieval

HF_DATASETS_CACHE works fine:

retriever = RagRetriever.from_pretrained("facebook/rag-token-nq", index_name="exact", use_dummy_dataset=False)

Using custom data configuration psgs_w100.nq.no_index
Reusing dataset wiki_dpr (/my_cache/cache/wiki_dpr/psgs_w100.nq.no_index/0.0.0/14b973bf2a456087ff69c0fd34526684eed22e48e0dfce4338f9a22b965ce7c2)
Using custom data configuration psgs_w100.nq.exact

Downloading and preparing dataset wiki_dpr/psgs_w100.nq.exact (download: Unknown size, generated: Unknown size, post-processed: Unknown size, total: Unknown size) to /my_cache/cache/wiki_dpr/psgs_w100.nq.exact/0.0.0/14b973bf2a456087ff69c0fd34526684eed22e48e0dfce4338f9a22b965ce7c2...

Could you create an issue on the datasets repo ? this seems unrelated

sure, I’ll post the other issue in the datasets repo.

Thanks!

0reactions

stale[bot]commented, Dec 11, 2020

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.