Layer Norm in XLM-R XL and XXL
See original GitHub issueHi 😃
I’m currently trying to convert the recently released XLM-R XL and XXL models into Transformers-compatible weights.
I’m using the latest fairseq
master version (with commit 2fd9d8a972794ba919174baf0d1828a5a4c626f3) and there’s something strange with the layer norm parameter.
For debugging, here are the parameter names (shortened) for the XLM-R Base model:
encoder.sentence_encoder.layernorm_embedding.weight
encoder.sentence_encoder.layernorm_embedding.bias
the parameter name is layernorm_embedding
. However, for the new XL models, it outputs:
encoder.sentence_encoder.layer_norm.weight
encoder.sentence_encoder.layer_norm.bias
So the parameter name is “layer_norm”. When loading the model using fairseq
library, like:
from fairseq.models.roberta import RobertaModel as FairseqRobertaModel
xlmr = FairseqRobertaModel.from_pretrained(roberta_checkpoint_path)
xlmr.eval() # disable dropout
The (shortened) model list for XLM-R Base shows:
RobertaHubInterface(
(model): RobertaModel(
(encoder): RobertaEncoder(
(sentence_encoder): TransformerEncoder(
(dropout_module): FairseqDropout()
(embed_tokens): Embedding(250002, 768, padding_idx=1)
(embed_positions): LearnedPositionalEmbedding(514, 768, padding_idx=1)
(layernorm_embedding): FusedLayerNorm(torch.Size([768]), eps=1e-05, elementwise_affine=True)
whereas the module list for the XL model shows:
RobertaHubInterface(
(model): RobertaModel(
(encoder): RobertaEncoder(
(sentence_encoder): TransformerEncoder(
(dropout_module): FairseqDropout()
(embed_tokens): Embedding(250880, 2560, padding_idx=1)
(embed_positions): LearnedPositionalEmbedding(514, 2560, padding_idx=1)
So a layer norm is missing in the XL model 🤔
Side note: I’ve updates the conversion script in Transformers library to be compatible with latest fairseq
master. At the end, the script compares a model (forward) pass between the original fairseq
model and the converted model to see the differences. For the old XLM-R Base model. the output is identical, whereas for XLM-R XL the difference is very high. Script can be found here.
Thanks for your help!
Issue Analytics
- State:
- Created 2 years ago
- Reactions:4
- Comments:8 (4 by maintainers)
Top GitHub Comments
@ngoyal2707 Thanks for your explanation 👍 I could see the changes in 54423d3b22a3e7f536e02e9e5445cef9becbd60d so we’re currently adjusting the RoBERTa model in Transformers to support the new models 😃
@ricardorei I installed
fairseq
viapip3 install git+https://github.com/pytorch/fairseq.git
, as I’ve also seen different error messages for variousfairseq
version. But with latest master I could load the new larger models 🤗