Unable to finetune RAG end2end due to error in finetune_rag.py file
See original GitHub issueEnvironment info
transformers
version: 4.8.2- Platform: Ubuntu 20.04
- Python version: 3.8.10
- PyTorch version (GPU?): 1.10.0.dev20210717+cu111
- Tensorflow version (GPU?): none
- Using GPU in script?: yes, 8 A6000s
- Using distributed or parallel set-up in script?: yes, ray 2.0.0.dev0
Who can help
Quentin Lhoest (@lhoestq), Patrick von Platen (@patrickvonplaten)
Information
Model I am using (Bert, XLNet …): RAG
The problem arises when using:
- the official example scripts: I used the finetune_rag_ray_end2end.sh script, but it doesn’t run in my env with either ray or pytorch as distributed retriever mode.
- my own modified scripts: I only modified the args to finetune_rag.py
--data_dir squad-training \
--output_dir model_checkpoints \
--model_name_or_path facebook/rag-token-base \
--model_type rag_token \
--fp16 \
--gpus 8 \
--profile \
--do_train \
--end2end \
--do_predict \
--n_val -1 \
--train_batch_size 8 \
--eval_batch_size 1 \
--max_source_length 128 \
--max_target_length 25 \
--val_max_target_length 25 \
--test_max_target_length 25 \
--label_smoothing 0.1 \
--dropout 0.1 \
--attention_dropout 0.1 \
--weight_decay 0.001 \
--adam_epsilon 1e-08 \
--max_grad_norm 0.1 \
--lr_scheduler polynomial \
--learning_rate 3e-05 \
--num_train_epochs 10 \
--warmup_steps 500 \
--gradient_accumulation_steps 8 \
--distributed_retriever pytorch \
--num_retrieval_workers 4 \
--passages_path SQUAD-KB/my_knowledge_dataset \
--index_path SQUAD-KB/my_knowledge_dataset_hnsw_index.faiss \
--index_name custom \
--context_encoder_name facebook/dpr-ctx_encoder-multiset-base \
--csv_path SQUAD-KB/squad-kb.csv \
--index_gpus 1 \
--gpu_order [5,6,7,8,9,0,1,2,3,4] \
--shard_dir test_dir/kb-shards \
--indexing_freq 500
The tasks I am working on is:
- my own task or dataset: I’m working with different datasets, but for now I’ve only tested on the data made available by @shamanez here
To reproduce
Steps to reproduce the behavior:
- pip install torch, transformers, pytorch_lightning, ray[default]
- Download the data from here and put the SQUAD-KB and squad-training directories in the same directory of the script
- Change the args to finetune_rag.py
- testing with ‘distributed_retriever pytorch‘ I get this error
INFO:__main__:please use RAY as the distributed retrieval method
Traceback (most recent call last):
File "finetune_rag.py", line 793, in <module>
main(args)
File "finetune_rag.py", line 730, in main
model: GenerativeQAModule = GenerativeQAModule(args)
File "finetune_rag.py", line 138, in __init__
model = self.model_class.from_pretrained(hparams.model_name_or_path, config=config, retriever=retriever)
UnboundLocalError: local variable 'retriever' referenced before assignment
Stopped all 12 Ray processes.
- testing with ‘distributed_retriever ray‘ I get this error
Global seed set to 42
initializing ddp: GLOBAL_RANK: 0, MEMBER: 1/8
test_dir/kb-shards
2021-07-18 10:38:55,472 INFO worker.py:805 -- Connecting to existing Ray cluster at address: 135.181.63.142:6379
INFO:__main__:Getting named actors for NODE_RANK 0, LOCAL_RANK 7
Traceback (most recent call last):
File "/root/Retr_Exp/transformers/examples/research_projects/rag-end2end-retriever/finetune_rag.py", line 793, in <module>
main(args)
File "/root/Retr_Exp/transformers/examples/research_projects/rag-end2end-retriever/finetune_rag.py", line 725, in main
named_actors = [ray.get_actor("retrieval_worker_{}".format(i)) for i in range(args.num_retrieval_workers)]
File "/root/Retr_Exp/transformers/examples/research_projects/rag-end2end-retriever/finetune_rag.py", line 725, in <listcomp>
named_actors = [ray.get_actor("retrieval_worker_{}".format(i)) for i in range(args.num_retrieval_workers)]
File "/root/CS-Env/lib/python3.8/site-packages/ray-2.0.0.dev0-py3.8-linux-x86_64.egg/ray/_private/client_mode_hook.py", line 82, in wrapper
return func(*args, **kwargs)
File "/root/CS-Env/lib/python3.8/site-packages/ray-2.0.0.dev0-py3.8-linux-x86_64.egg/ray/worker.py", line 1746, in get_actor
return worker.core_worker.get_named_actor_handle(name)
File "python/ray/_raylet.pyx", line 1565, in ray._raylet.CoreWorker.get_named_actor_handle
File "python/ray/_raylet.pyx", line 158, in ray._raylet.check_status
ValueError: Failed to look up actor with name 'retrieval_worker_0'. You are either trying to look up a named actor you didn't create, the named actor died, or the actor hasn't been created because named actor creation is asynchronous.
Expected behavior
The expected behaviour consist in being able to finetune RAG without errors
Issue Analytics
- State:
- Created 2 years ago
- Comments:13 (7 by maintainers)
Top Results From Across the Web
Fine-tune the Entire RAG Architecture (including DPR retriever ...
Abstract: In this paper, we illustrate how to fine-tune the entire Retrieval Augment Generation (RAG) architecture in an end-to-end manner.
Read more >What to do when you get an error - Hugging Face Course
In this section we'll look at some common errors that can occur when you're trying to generate predictions from your freshly tuned Transformer...
Read more >Unable to Finetune Deberta - Stack Overflow
I am trying to finetune deberta for irony detection task, colab's notebook ... checkpoint with AutoModel, I'm getting the following error :
Read more >Fine-tune the Entire RAG Architecture (including ... - arXiv Vanity
https://github.com/huggingface/transformers/tree/master/examples/research_projects/rag-end2end-retriever. \keywords. NLP Question Answering Information ...
Read more >Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start FreeTop Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Top GitHub Comments
I believe part of the reason that the code runs slowly is, because the validation step is run after every training step. See this. @shamanez is my understanding correct, and is it necessary for the validation step to be run after every training step?
P.S. Thank you for updating the implementation with the newer Ray and PL versions, it’s very helpful 😄
Hi,
first, the RAG-end2end cannot get a train with pytorch_retriever, since it’s not enabled and you have to use ray_retriever (Since it is very slow and hard to update the indexed KB).
The second issue is related you your distributed system and we have discussed it in this issue.
For a quick fix, change the line as follows (change the variables in to int) :