question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Question regarding training of BartForConditionalGeneration

See original GitHub issue

Hello Guys,

I am trying to fine-tune the BART summarization model but due to the lack of big dataset, having some difficulties with the fine-tuning.

Thus, I decided to look at the trainig process of BartForConditionalGeneration model in detail. I came across this article, Introducing BART from one of the engineers, @sshleifer, at HuggingFace. It says that BartModel was directly fine-tuned for the summarisation task without any new randomly initialized heads.

My question is about this fine-tuning process, especially on CNN-DailyMail dataset. Do you guys fine-tune the entire Bart model or only the decoder or something else?

I looked at the example fine-tuning script provided on the GitHub but I didn’t find anything related to freezing some part of the model.

 

I also tried to look at the source code of the BartForConditionalGeneration model and observed the following -

Its just adds a linear layer on top of the BartModel (copy-pasting the __init__ code here for quick reference).

self.model = BartModel(config)
self.register_buffer("final_logits_bias", torch.zeros((1, self.model.shared.num_embeddings)))
self.lm_head = nn.Linear(config.d_model, self.model.shared.num_embeddings, bias=False)

At first, I thought these are the new parameters that are being introduced and thus, being trained. Therefore, I tried the following code to check the number of trainable parameters while keeping the endoer and decoder fixed -

from transformers import BartModel, BartForConditionalGeneration, BartTokenizer

def freeze_params(model):
    for par in model.parameters():
        par.requires_grad = False

model_sum = BartForConditionalGeneration.from_pretrained('facebook/bart-large')
freeze_params(model_sum.get_encoder()) ## freeze the encoder
freeze_params(model_sum.get_decoder()) ## freeze the decoder 

model_sum.train() ## set the train mode
train_p = [p for p in model_sum.parameters() if p.requires_grad] ## get the trainable params
print(f'Length of train params in Summarization Model : {len(train_p)}')

But this code shows that the list is empty. One thing I can do is to explictly set the requires_grad=True for the paramters in the model_sum.lm_head and only fine-tune these parameters. But I am curious to understand the original training/fine-tuning process. It would be of great help to me if you guys could answer my question.

P.S. - Love the HuggingFace library.

Thanks,
Naman

Issue Analytics

  • State:closed
  • Created 3 years ago
  • Comments:15 (6 by maintainers)

github_iconTop GitHub Comments

1reaction
bnaman50commented, Apr 27, 2022

Hey @JessicaLopezEspejel ,

model.get_encoder().layers will give you a list (torch.nn.modules.container.ModuleList to be precise) of layers in encoder, and you can freeze the required layers using the freeze_params function provided in the utils.py file. I have included a small code snippet for your reference. Hope this helps!

from torch import nn
from transformers import AutoTokenizer, AutoModel

def freeze_params(model: nn.Module):
    """Set requires_grad=False for each of model.parameters()"""
    for par in model.parameters():
        par.requires_grad = False

model = AutoModel.from_pretrained("facebook/bart-large")
enc_layers = model.get_encoder().layers
freeze_params(enc_layers[0])  # freeze layer 0
dropout = enc_layers[0].dropout   # return dropout value for layer 0
enc_layers[0].dropout = 0.5  # set dropout value for layer 0

Thanks, Naman

1reaction
sshleifercommented, Mar 2, 2021
  • I would try concatenating.
  • I would also grid search evaluation parameters (min_length, max_length, num_beam, length_penalty).
  • I would evaluate a few distillbart/distill-pegasus variants before any fine-tuning to decide which to start from.
Read more comments on GitHub >

github_iconTop Results From Across the Web

Question regarding training of BartForConditionalGeneration
Hello Guys,. I am trying to fine-tune the BART summarization model but due to the lack of big dataset, having some difficulties with...
Read more >
Transformers BART Model Explained for Text Summarization
The BART HugggingFace model allows the pre-trained weights and weights fine-tuned on question-answering, text summarization, ...
Read more >
Why is transformer decoder always generating output of same ...
1 Answer 1 · Hi, I see your point, however the fact that something is not used in the original example doesn't really...
Read more >
Submission Details - Leaderboards by Allen AI
We utilized Facebook's bart-base implementation, training on the full Break dataset until validation loss stopped decreasing. We leveraged a specialized target ...
Read more >
Multi-Document Summarization with BART | by Ashwin N
Summarization is a central problem in Natural Language Processing with ... Based on a comparative study of pre-training objectives, ...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found