PreTrainedModel's tie_weights invocation needs to be configurable
See original GitHub issuePreTrainedModel
defines tie_weights
method and then in one place suggests
Takes care of tying weights embeddings afterwards if the model class has a :obj:
tie_weights()
method.
But since the super-class has it defined, it’s always there.
So the only way for a sub-class to avoid this “tying” is to override it with:
def tie_weights(self): pass
if nothing else happens that comment needs to be edited to suggest a noop override in the sub-class.
But it took some hunting to get there, so a better solution is needed.
Most likely, currently, most (all?) models in transformers with encoder/decoder share token embed weights, hence the issue didn’t come up. I’m working on porting a fairseq transformer and there the enc/dec token embeds aren’t shared.
I propose a solution which adds a new param to PretrainedConfig
, say: is_enc_dec_sharing_embeds=True
and let the subclass override those, then add at the start of tie_weights
in modeling_utils.py
def tie_weights(self):
if not self.config.is_enc_dec_sharing_embeds:
return
that way it’s easy to quickly become aware that an action needs to be taken and set the desired behavior from within the subclass.
Thoughts?
If the proposed solution is agreeable, please, let me know which config param name it should be is_enc_dec_sharing_embeds
- or different and I will submit a PR.
Thank you.
edit:
OK, having had a closer look:
grep -r -A 2 'def tie_weights' src/transformers | grep pass | wc -l
we have 5 sub-classes that override it with a no-op so only some rely on the default. Bad superclass, no cookies for you.
Issue Analytics
- State:
- Created 3 years ago
- Comments:6 (6 by maintainers)
Awesome, I will open a PR. I actually need this feature for the
EncoderDecoderModel
as well.This sounds like a good idea. I would advocate for a
tie_word_embeddings
parameter in the configuration as @patrickvonplaten suggested, but I would keeptie_weights
as the method that does the weight tying rather than renaming that method as well. Just a quick glance at the configuration tells you which weights it’s about to tie, and it will able to handle other cases of weight tying that we might encounter in the future without the need of adding additional new methods.