Question on Prefix tuning code
See original GitHub issueHi, I am looking at prefix tuning code…I have few queries on the implementation.
- what exactly are the variables in these lines? I understand that prefix tuning provides input to every layer of the encoder-decoder model…But my understanding is that there should be a single wte and control_trans; not sure what the variables in the highlighted lines do.
- I dont understand why the
*2
in this line of code? - What does the
control_trans
variable mean in the code? what is its function? - Also, I see another variable
mid_dim
. What is it conceptually?
Thank you
Issue Analytics
- State:
- Created a year ago
- Comments:6 (3 by maintainers)
Top Results From Across the Web
Alternative for fine-tuning? Prefix-tuning may be your answer!
A recent paper accepted in ACL 2021 Prefix-Tuning: Optimizing Continuous Prompts for Generation answered this question with a new concept.
Read more >Prefix-Tuning: Optimizing Continuous Prompts for Generation
The key question is how to augment the LM architecture and decide which subset of pretrained parameters to tune. One line of research...
Read more >Prefix-Tuning: Optimizing Continuous Prompts for Generation
In this paper, we propose prefix-tuning, a lightweight alternative to fine-tuning for natural language generation tasks, which keeps language model ...
Read more >prefix tuning with t5-3b · Issue #27 · HKUNLP/UnifiedSKG
I am trying to run prefix tuning with t5-3b, but I got some strange error ... I just double-checked the code in preifx-tuning...
Read more >On Robust Prefix-Tuning for Text Classification
In this work, we propose a robust prefix-tuning framework that preserves the efficiency and modularity of prefix-tuning. The core idea of our framework...
Read more >
Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free
Top Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Yes,
wte
is for the cross-attention. We did not use “word initialization” for the prefix because they showed the benefits of such initialization under a low-data setting with only 100 samples. Actually, I am not sure, in Lisa’s paper, whether this “word initialization” was used jointly with the re-parameterization trick or only used for the embedding-only ablation. If you have any ideas, please let me know!Yes, we assume
num_encoder_layers == num_decoder_layers
.The permutation operation is used to make the tensor shape compatible with that of the key-value pairs.
Hi, Thanks for your interest!
Our prefix-tuning code is a cleaned-up version of Lisa’s original implementation for BART. Answers to your questions are provided below, but we also recommend you look for more details in Lisa’s implementation and paper.
*2
means the dimension of the attention key plus the dimension of the attention value.control_trans
is part of the re-parameterization trick introduced in Lisa’s paper.mid_dim
is also part of the re-parameterization trick introduced in Lisa’s paper.