Question regarding training of BartForConditionalGeneration
See original GitHub issueHello Guys,
I am trying to fine-tune the BART summarization model but due to the lack of big dataset, having some difficulties with the fine-tuning.
Thus, I decided to look at the trainig process of BartForConditionalGeneration model in detail. I came across this article, Introducing BART from one of the engineers, @sshleifer, at HuggingFace. It says that BartModel was directly fine-tuned for the summarisation task without any new randomly initialized heads.
My question is about this fine-tuning process, especially on CNN-DailyMail dataset. Do you guys fine-tune the entire Bart model or only the decoder or something else?
I looked at the example fine-tuning script provided on the GitHub but I didn’t find anything related to freezing some part of the model.
I also tried to look at the source code of the BartForConditionalGeneration model and observed the following -
Its just adds a linear layer on top of the BartModel (copy-pasting the __init__
code here for quick reference).
self.model = BartModel(config)
self.register_buffer("final_logits_bias", torch.zeros((1, self.model.shared.num_embeddings)))
self.lm_head = nn.Linear(config.d_model, self.model.shared.num_embeddings, bias=False)
At first, I thought these are the new parameters that are being introduced and thus, being trained. Therefore, I tried the following code to check the number of trainable parameters while keeping the endoer and decoder fixed -
from transformers import BartModel, BartForConditionalGeneration, BartTokenizer
def freeze_params(model):
for par in model.parameters():
par.requires_grad = False
model_sum = BartForConditionalGeneration.from_pretrained('facebook/bart-large')
freeze_params(model_sum.get_encoder()) ## freeze the encoder
freeze_params(model_sum.get_decoder()) ## freeze the decoder
model_sum.train() ## set the train mode
train_p = [p for p in model_sum.parameters() if p.requires_grad] ## get the trainable params
print(f'Length of train params in Summarization Model : {len(train_p)}')
But this code shows that the list is empty. One thing I can do is to explictly set the requires_grad=True
for the paramters in the model_sum.lm_head
and only fine-tune these parameters. But I am curious to understand the original training/fine-tuning process. It would be of great help to me if you guys could answer my question.
P.S. - Love the HuggingFace library.
Thanks,
Naman
Issue Analytics
- State:
- Created 3 years ago
- Comments:15 (6 by maintainers)
Top GitHub Comments
Hey @JessicaLopezEspejel ,
model.get_encoder().layers
will give you a list (torch.nn.modules.container.ModuleList
to be precise) of layers in encoder, and you can freeze the required layers using thefreeze_params
function provided in theutils.py
file. I have included a small code snippet for your reference. Hope this helps!Thanks, Naman