question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Unable to finetune RAG end2end due to error in finetune_rag.py file

See original GitHub issue

Environment info

  • transformers version: 4.8.2
  • Platform: Ubuntu 20.04
  • Python version: 3.8.10
  • PyTorch version (GPU?): 1.10.0.dev20210717+cu111
  • Tensorflow version (GPU?): none
  • Using GPU in script?: yes, 8 A6000s
  • Using distributed or parallel set-up in script?: yes, ray 2.0.0.dev0

Who can help

Quentin Lhoest (@lhoestq), Patrick von Platen (@patrickvonplaten)

Information

Model I am using (Bert, XLNet …): RAG

The problem arises when using:

  • the official example scripts: I used the finetune_rag_ray_end2end.sh script, but it doesn’t run in my env with either ray or pytorch as distributed retriever mode.
  • my own modified scripts: I only modified the args to finetune_rag.py
    --data_dir  squad-training \
    --output_dir model_checkpoints \
    --model_name_or_path facebook/rag-token-base \
    --model_type rag_token \
    --fp16 \
    --gpus 8  \
    --profile \
    --do_train \
    --end2end \
    --do_predict \
    --n_val -1  \
    --train_batch_size 8 \
    --eval_batch_size 1 \
    --max_source_length 128 \
    --max_target_length 25 \
    --val_max_target_length 25 \
    --test_max_target_length 25 \
    --label_smoothing 0.1 \
    --dropout 0.1 \
    --attention_dropout 0.1 \
    --weight_decay 0.001 \
    --adam_epsilon 1e-08 \
    --max_grad_norm 0.1 \
    --lr_scheduler polynomial \
    --learning_rate 3e-05 \
    --num_train_epochs 10 \
    --warmup_steps 500 \
    --gradient_accumulation_steps 8 \
    --distributed_retriever pytorch \
    --num_retrieval_workers 4  \
    --passages_path SQUAD-KB/my_knowledge_dataset \
    --index_path  SQUAD-KB/my_knowledge_dataset_hnsw_index.faiss \
    --index_name custom \
    --context_encoder_name facebook/dpr-ctx_encoder-multiset-base \
    --csv_path SQUAD-KB/squad-kb.csv \
    --index_gpus 1 \
    --gpu_order [5,6,7,8,9,0,1,2,3,4] \
    --shard_dir test_dir/kb-shards \
    --indexing_freq 500 

The tasks I am working on is:

  • my own task or dataset: I’m working with different datasets, but for now I’ve only tested on the data made available by @shamanez here

To reproduce

Steps to reproduce the behavior:

  1. pip install torch, transformers, pytorch_lightning, ray[default]
  2. Download the data from here and put the SQUAD-KB and squad-training directories in the same directory of the script
  3. Change the args to finetune_rag.py
  4. testing with ‘distributed_retriever pytorch‘ I get this error
INFO:__main__:please use RAY as the distributed retrieval method
Traceback (most recent call last):
  File "finetune_rag.py", line 793, in <module>
    main(args)
  File "finetune_rag.py", line 730, in main
    model: GenerativeQAModule = GenerativeQAModule(args)
  File "finetune_rag.py", line 138, in __init__
    model = self.model_class.from_pretrained(hparams.model_name_or_path, config=config, retriever=retriever)
UnboundLocalError: local variable 'retriever' referenced before assignment
Stopped all 12 Ray processes.
  1. testing with ‘distributed_retriever ray‘ I get this error
Global seed set to 42
initializing ddp: GLOBAL_RANK: 0, MEMBER: 1/8
test_dir/kb-shards
2021-07-18 10:38:55,472	INFO worker.py:805 -- Connecting to existing Ray cluster at address: 135.181.63.142:6379
INFO:__main__:Getting named actors for NODE_RANK 0, LOCAL_RANK 7
Traceback (most recent call last):
  File "/root/Retr_Exp/transformers/examples/research_projects/rag-end2end-retriever/finetune_rag.py", line 793, in <module>
    main(args)
  File "/root/Retr_Exp/transformers/examples/research_projects/rag-end2end-retriever/finetune_rag.py", line 725, in main
    named_actors = [ray.get_actor("retrieval_worker_{}".format(i)) for i in range(args.num_retrieval_workers)]
  File "/root/Retr_Exp/transformers/examples/research_projects/rag-end2end-retriever/finetune_rag.py", line 725, in <listcomp>
    named_actors = [ray.get_actor("retrieval_worker_{}".format(i)) for i in range(args.num_retrieval_workers)]
  File "/root/CS-Env/lib/python3.8/site-packages/ray-2.0.0.dev0-py3.8-linux-x86_64.egg/ray/_private/client_mode_hook.py", line 82, in wrapper
    return func(*args, **kwargs)
  File "/root/CS-Env/lib/python3.8/site-packages/ray-2.0.0.dev0-py3.8-linux-x86_64.egg/ray/worker.py", line 1746, in get_actor
    return worker.core_worker.get_named_actor_handle(name)
  File "python/ray/_raylet.pyx", line 1565, in ray._raylet.CoreWorker.get_named_actor_handle
  File "python/ray/_raylet.pyx", line 158, in ray._raylet.check_status
ValueError: Failed to look up actor with name 'retrieval_worker_0'. You are either trying to look up a named actor you didn't create, the named actor died, or the actor hasn't been created because named actor creation is asynchronous.

Expected behavior

The expected behaviour consist in being able to finetune RAG without errors

Issue Analytics

  • State:closed
  • Created 2 years ago
  • Comments:13 (7 by maintainers)

github_iconTop GitHub Comments

1reaction
aidansancommented, Sep 10, 2022

I believe part of the reason that the code runs slowly is, because the validation step is run after every training step. See this. @shamanez is my understanding correct, and is it necessary for the validation step to be run after every training step?

P.S. Thank you for updating the implementation with the newer Ray and PL versions, it’s very helpful 😄

1reaction
shamanezcommented, Jul 19, 2021

Hi,

  1. first, the RAG-end2end cannot get a train with pytorch_retriever, since it’s not enabled and you have to use ray_retriever (Since it is very slow and hard to update the indexed KB).

  2. The second issue is related you your distributed system and we have discussed it in this issue.

For a quick fix, change the line as follows (change the variables in to int) :

    **if ("LOCAL_RANK" not in os.environ or int(os.environ["LOCAL_RANK"]) == 0) and (
        "NODE_RANK" not in os.environ or int(os.environ["NODE_RANK"]) == 0
    ):**
Read more comments on GitHub >

github_iconTop Results From Across the Web

Fine-tune the Entire RAG Architecture (including DPR retriever ...
Abstract: In this paper, we illustrate how to fine-tune the entire Retrieval Augment Generation (RAG) architecture in an end-to-end manner.
Read more >
What to do when you get an error - Hugging Face Course
In this section we'll look at some common errors that can occur when you're trying to generate predictions from your freshly tuned Transformer...
Read more >
Unable to Finetune Deberta - Stack Overflow
I am trying to finetune deberta for irony detection task, colab's notebook ... checkpoint with AutoModel, I'm getting the following error :
Read more >
Fine-tune the Entire RAG Architecture (including ... - arXiv Vanity
https://github.com/huggingface/transformers/tree/master/examples/research_projects/rag-end2end-retriever. \keywords. NLP Question Answering Information ...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found