[Bug] Adapter and LoRA for Roberta
See original GitHub issueRuning the below setting on SST2 and MNLI: `attn_mode=“adapter” attn_option=“sequential” attn_composition=“add” attn_bn=200 # attn bottleneck dim
ffn_mode=“adapter” ffn_option=“sequential” ffn_adapter_layernorm_option=“none” ffn_adapter_init_option=“bert” ffn_adapter_scalar=“1” ffn_bn=200 # ffn bottleneck dim `
Several errors were raised. It seems some parameters were set incorrectly like d_model
, dropout
, in modeling_roberta.py.
I just fix them and the log make me confused:
Houlsby added adapted in two places: after self-attention and after FFN. So why add Adapter inside the self-attention and what’s the adapter_layer_norm_before.weight used for?
Thanks
Issue Analytics
- State:
- Created a year ago
- Comments:6 (2 by maintainers)
Top Results From Across the Web
LoRA: Low-Rank Adaptation of Large Language Models
We release a package that facilitates the integration of LoRA with PyTorch models and provide our implementations and model checkpoints for RoBERTa, DeBERTa, ......
Read more >loralib - PyPI
PyTorch implementation of low-rank adaptation (LoRA), a parameter-efficient approach to adapt a large pre-trained deep learning model which ...
Read more >Create a LoRa node, part 2: Prepare the module adapter plate
This is part 2 in how to create a LoRa node.In this movie the module adapter plate is prepared so the HopeRF RFM95...
Read more >[PDF] Plug-and-Play Adaptation for Continuously-updated QA ...
K-Adapter: Infusing Knowledge into Pre-Trained Models with Adapters ... infusion and captures richer factual and commonsense knowledge than RoBERTa.
Read more >LoRa adapter - will be discontinued by mid 2024
Otherwise you will get a 403 FORBIDDEN error even-though the supplied credentials are correct. Register a device and credentials for the LoRa Network...
Read more >Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start FreeTop Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Top GitHub Comments
We never tried tuning the classifier head but training it sounds like a reasonably cheap trick that should give higher performance. I am not sure about the reason, I guess that it is easy to learn features that can be separated by a random projection given that both MNLI and SST2 are just two- or three-way classification.
The main advantage or the most intriguing part of parameter-efficient tuning is not about reducing the training cost – it doesn’t really reduce the training cost much since it often takes longer to converge. In my opinion, the advantages are:
(1) storage savings as you mentioned; (2) more interestingly, it has potentials on multi-task settings – one small module is responsible for one task/domain while most of the parameters are shared, this separates model capacities in a modular way and may enable many applications, for example, merging multiple adapters efficiently to create models that perform well on multiple domains; continuously adding new capabilities without breaking original capabilities to an existing system by just adding trained adapters. These may not be achieved easily by traditional fine-tuning. (3) parameter-efficient tuning mitigates catastrophic forgetting by design since old parameters are frozen. (4) tuning few parameters is also shown in some papers to be more robust than full fine-tuning and is superior for few-shot learning
Thanks for your reply!