Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Layer Norm in XLM-R XL and XXL

See original GitHub issue

Hi 😃

I’m currently trying to convert the recently released XLM-R XL and XXL models into Transformers-compatible weights.

I’m using the latest fairseq master version (with commit 2fd9d8a972794ba919174baf0d1828a5a4c626f3) and there’s something strange with the layer norm parameter.

For debugging, here are the parameter names (shortened) for the XLM-R Base model:

encoder.sentence_encoder.layernorm_embedding.weight        
encoder.sentence_encoder.layernorm_embedding.bias

the parameter name is layernorm_embedding. However, for the new XL models, it outputs:

encoder.sentence_encoder.layer_norm.weight
encoder.sentence_encoder.layer_norm.bias

So the parameter name is “layer_norm”. When loading the model using fairseq library, like:

from fairseq.models.roberta import RobertaModel as FairseqRobertaModel


xlmr = FairseqRobertaModel.from_pretrained(roberta_checkpoint_path)
xlmr.eval()  # disable dropout

The (shortened) model list for XLM-R Base shows:

RobertaHubInterface(                                                                                 
  (model): RobertaModel(                                                                                  
    (encoder): RobertaEncoder(                                                
      (sentence_encoder): TransformerEncoder(                               
        (dropout_module): FairseqDropout()                                                               
        (embed_tokens): Embedding(250002, 768, padding_idx=1)               
        (embed_positions): LearnedPositionalEmbedding(514, 768, padding_idx=1)                           
        (layernorm_embedding): FusedLayerNorm(torch.Size([768]), eps=1e-05, elementwise_affine=True)

whereas the module list for the XL model shows:

RobertaHubInterface(                                                                                      
  (model): RobertaModel(                                                                              
    (encoder): RobertaEncoder(                                                                            
      (sentence_encoder): TransformerEncoder(                                                             
        (dropout_module): FairseqDropout()
        (embed_tokens): Embedding(250880, 2560, padding_idx=1)
        (embed_positions): LearnedPositionalEmbedding(514, 2560, padding_idx=1)

So a layer norm is missing in the XL model 🤔

Side note: I’ve updates the conversion script in Transformers library to be compatible with latest fairseq master. At the end, the script compares a model (forward) pass between the original fairseq model and the converted model to see the differences. For the old XLM-R Base model. the output is identical, whereas for XLM-R XL the difference is very high. Script can be found here.

Thanks for your help!

Issue Analytics

State:
Created 2 years ago
Reactions:4
Comments:8 (4 by maintainers)

Top GitHub Comments

3reactions

stefan-itcommented, Jun 9, 2021

@ngoyal2707 Thanks for your explanation 👍 I could see the changes in 54423d3b22a3e7f536e02e9e5445cef9becbd60d so we’re currently adjusting the RoBERTa model in Transformers to support the new models 😃

1reaction

stefan-itcommented, Jun 9, 2021

@ricardorei I installed fairseq via pip3 install git+https://github.com/pytorch/fairseq.git , as I’ve also seen different error messages for various fairseq version. But with latest master I could load the new larger models 🤗