question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Finetuning XLM Longformer gets a lot of NaN values

See original GitHub issue

I am trying to follow your code for making custom longformer for XLM models (typically XLM-Roberta), however, I get NaN values as soon as I start training my models for a downstream classification. Hence, I am confused on where I am doing wrong here.

I am using Pytorch 1.6 and Transformers 3.1.

Here’s my basic structure of the code, note that I am making only changes of Roberta to XLM-Roberta here:

class RobertaLongSelfAttention(LongformerSelfAttention):
    '''
    Wrapper around Longformer's self attention
    '''
    def forward(
        self,
        hidden_states,
        attention_mask=None,
        head_mask=None,
        encoder_hidden_states=None,
        encoder_attention_mask=None,
        output_attentions=False,
    ):
        return super().forward(
            hidden_states, 
            attention_mask=attention_mask, 
            output_attentions=output_attentions
        )


class RobertaLongForSequenceClassification(XLMRobertaForSequenceClassification):
    '''
    Change all layers to have longer attention
    '''
    def __init__(self, config):
        super().__init__(config)
        for i, layer in enumerate(self.roberta.encoder.layer):
            # replace the `modeling_bert.BertSelfAttention` object with `LongformerSelfAttention`
            layer.attention.self = RobertaLongSelfAttention(config, layer_id=i)

def create_long_model(save_model_to, attention_window, max_pos):
    model = XLMRobertaForSequenceClassification.from_pretrained('xlm-roberta-base',
                                                  gradient_checkpointing=True)
    tokenizer = XLMRobertaTokenizer.from_pretrained('xlm-roberta-base',
                                                    model_max_length=max_pos,
                                                    gradient_checkpointing=True)
    config = model.config

    # extend position embeddings
    tokenizer.model_max_length = max_pos
    tokenizer.init_kwargs['model_max_length'] = max_pos
    current_max_pos, embed_size = model.roberta.embeddings.position_embeddings.weight.shape
    max_pos += 2  # NOTE: RoBERTa has positions 0,1 reserved, so embedding size is max position + 2
    config.max_position_embeddings = max_pos
    assert max_pos > current_max_pos
    # allocate a larger position embedding matrix
    new_pos_embed = model.roberta.embeddings.position_embeddings.weight.new_empty(max_pos, embed_size)
    # copy position embeddings over and over to initialize the new position embeddings
    k = 2
    step = current_max_pos - 2
    while k < max_pos - 1:
        new_pos_embed[k:(k + step)] = model.roberta.embeddings.position_embeddings.weight[2:]
        k += step
    model.roberta.embeddings.position_embeddings.weight.data = new_pos_embed

    # replace the `modeling_bert.BertSelfAttention` object with `LongformerSelfAttention`
    config.attention_window = [attention_window] * config.num_hidden_layers
    for i, layer in enumerate(model.roberta.encoder.layer):
        longformer_self_attn = LongformerSelfAttention(config, layer_id=i)
        longformer_self_attn.query = layer.attention.self.query
        longformer_self_attn.key = layer.attention.self.key
        longformer_self_attn.value = layer.attention.self.value

        longformer_self_attn.query_global = layer.attention.self.query
        longformer_self_attn.key_global = layer.attention.self.key
        longformer_self_attn.value_global = layer.attention.self.value

        layer.attention.self = longformer_self_attn

    logger.info(f'saving model to {save_model_to}')
    model.save_pretrained(save_model_to)
    tokenizer.save_pretrained(save_model_to)
    return model, tokenizer

and here’s how I load the model once I save it to a path. Currently I am using the position length of 1024:

logger.info(f'Converting xlm-roberta-base into xlm-roberta-base-{model_args.max_pos}')
model, tokenizer = create_long_model(
    save_model_to=model_path, attention_window=model_args.attention_window, max_pos=model_args.max_pos)

tokenizer = XLMRobertaTokenizer.from_pretrained(model_path, gradient_checkpointing=True)
model_roberta = RobertaLongForSequenceClassification.from_pretrained(model_path)

Issue Analytics

  • State:open
  • Created 3 years ago
  • Reactions:4
  • Comments:7

github_iconTop GitHub Comments

4reactions
shanybarhomcommented, Oct 21, 2020

@pranav-ust @ibeltagy I found a solution that works for me: Replace: new_pos_embed = model.roberta.embeddings.position_embeddings.weight.new_empty(max_pos, embed_size) with: new_pos_embed = model.roberta.embeddings.position_embeddings.weight.new_zeros(max_pos, embed_size)

The idea behind that is that for some reason the first line of code works well for RoBERTa (creates tensor of zeros) but for the XLM-R it creates a tensor that contains “trashy” values (very big numbers). As the first two position embeddings are not initialized, this causes NaNs values in the padding vector.

0reactions
MarkusSagencommented, Nov 24, 2020

@pranav-ust @ibeltagy I found a solution that works for me: Replace: new_pos_embed = model.roberta.embeddings.position_embeddings.weight.new_empty(max_pos, embed_size) with: new_pos_embed = model.roberta.embeddings.position_embeddings.weight.new_zeros(max_pos, embed_size)

The idea behind that is that for some reason the first line of code works well for RoBERTa (creates tensor of zeros) but for the XLM-R it creates a tensor that contains “trashy” values (very big numbers). As the first two position embeddings are not initialized, this causes NaNs values in the padding vector.

Nice! I solved this by lowering the gradient clipping from 5.0 to 1.0. Also got NaN when training a full XLM-R model otherwise

Read more comments on GitHub >

github_iconTop Results From Across the Web

Fine-tuning a pretrained model - Hugging Face
In this tutorial, we will show you how to fine-tune a pretrained model from the Transformers library. In TensorFlow, models can be directly...
Read more >
Longformer version of RoBERTa error · Issue #9588 - GitHub
Model I am using script to initialize Longformer starting from HerBERT ... field from transformers import RobertaForMaskedLM, XLMTokenizer, ...
Read more >
Why I am getting tensor of NaN values in PyTorch ...
I am fine-tuning distil-bert model for 200k iterations. Once it saves the checkpoint file, I do the inference. However, my inference vector ...
Read more >
Top 1805 resources for bert models - NLP Hub - Metatext
This is a complete list of resources about Bert Models for your next project in natural language processing. Found 1805 Bert. Let's get...
Read more >
A Survey of Controllable Text Generation using Transformer ...
As discussed in Section 2.3, "PLM + fine-tuning" has become a new paradigm in the general field of NLP. First, a large amount...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found