question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Layer Norm in XLM-R XL and XXL

See original GitHub issue

Hi 😃

I’m currently trying to convert the recently released XLM-R XL and XXL models into Transformers-compatible weights.

I’m using the latest fairseq master version (with commit 2fd9d8a972794ba919174baf0d1828a5a4c626f3) and there’s something strange with the layer norm parameter.

For debugging, here are the parameter names (shortened) for the XLM-R Base model:

encoder.sentence_encoder.layernorm_embedding.weight        
encoder.sentence_encoder.layernorm_embedding.bias

the parameter name is layernorm_embedding. However, for the new XL models, it outputs:

encoder.sentence_encoder.layer_norm.weight
encoder.sentence_encoder.layer_norm.bias

So the parameter name is “layer_norm”. When loading the model using fairseq library, like:

from fairseq.models.roberta import RobertaModel as FairseqRobertaModel


xlmr = FairseqRobertaModel.from_pretrained(roberta_checkpoint_path)
xlmr.eval()  # disable dropout

The (shortened) model list for XLM-R Base shows:

RobertaHubInterface(                                                                                 
  (model): RobertaModel(                                                                                  
    (encoder): RobertaEncoder(                                                
      (sentence_encoder): TransformerEncoder(                               
        (dropout_module): FairseqDropout()                                                               
        (embed_tokens): Embedding(250002, 768, padding_idx=1)               
        (embed_positions): LearnedPositionalEmbedding(514, 768, padding_idx=1)                           
        (layernorm_embedding): FusedLayerNorm(torch.Size([768]), eps=1e-05, elementwise_affine=True)

whereas the module list for the XL model shows:

RobertaHubInterface(                                                                                      
  (model): RobertaModel(                                                                              
    (encoder): RobertaEncoder(                                                                            
      (sentence_encoder): TransformerEncoder(                                                             
        (dropout_module): FairseqDropout()
        (embed_tokens): Embedding(250880, 2560, padding_idx=1)
        (embed_positions): LearnedPositionalEmbedding(514, 2560, padding_idx=1)

So a layer norm is missing in the XL model 🤔

Side note: I’ve updates the conversion script in Transformers library to be compatible with latest fairseq master. At the end, the script compares a model (forward) pass between the original fairseq model and the converted model to see the differences. For the old XLM-R Base model. the output is identical, whereas for XLM-R XL the difference is very high. Script can be found here.

Thanks for your help!

Issue Analytics

  • State:closed
  • Created 2 years ago
  • Reactions:4
  • Comments:8 (4 by maintainers)

github_iconTop GitHub Comments

3reactions
stefan-itcommented, Jun 9, 2021

@ngoyal2707 Thanks for your explanation 👍 I could see the changes in 54423d3b22a3e7f536e02e9e5445cef9becbd60d so we’re currently adjusting the RoBERTa model in Transformers to support the new models 😃

1reaction
stefan-itcommented, Jun 9, 2021

@ricardorei I installed fairseq via pip3 install git+https://github.com/pytorch/fairseq.git , as I’ve also seen different error messages for various fairseq version. But with latest master I could load the new larger models 🤗

Read more comments on GitHub >

github_iconTop Results From Across the Web

Layer Norm in XLM-R XL and XXL · Issue #3600 - GitHub
Hi :) I'm currently trying to convert the recently released XLM-R XL and XXL models into Transformers-compatible weights.
Read more >
XLM-RoBERTa-XL - Hugging Face
Our two new models dubbed XLM-R XL and XLM-R XXL outperform XLM-R by 1.8% and 2.4% ... defaults to 36) — Number of...
Read more >
arXiv:2105.00572v1 [cs.CL] 2 May 2021
We use pre-LayerNorm setting for both the models which was more stable during training. For all the tasks in finetuning, we use batch...
Read more >
Larger-Scale Transformers for Multilingual Masked Language ...
Our two new models dubbed XLM-R XL and XLM-R XXL outperform XLM-R by 1.8% and ... We use pre-LayerNorm setting for both the...
Read more >
Why do transformers use layer norm instead of batch norm?
It seems that it has been the standard to use batchnorm in CV tasks, and layernorm in NLP tasks. The original Attention is...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found