Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Finetuning XLM Longformer gets a lot of NaN values

See original GitHub issue

I am trying to follow your code for making custom longformer for XLM models (typically XLM-Roberta), however, I get NaN values as soon as I start training my models for a downstream classification. Hence, I am confused on where I am doing wrong here.

I am using Pytorch 1.6 and Transformers 3.1.

Here’s my basic structure of the code, note that I am making only changes of Roberta to XLM-Roberta here:

class RobertaLongSelfAttention(LongformerSelfAttention):
    '''
    Wrapper around Longformer's self attention
    '''
    def forward(
        self,
        hidden_states,
        attention_mask=None,
        head_mask=None,
        encoder_hidden_states=None,
        encoder_attention_mask=None,
        output_attentions=False,
    ):
        return super().forward(
            hidden_states, 
            attention_mask=attention_mask, 
            output_attentions=output_attentions
        )


class RobertaLongForSequenceClassification(XLMRobertaForSequenceClassification):
    '''
    Change all layers to have longer attention
    '''
    def __init__(self, config):
        super().__init__(config)
        for i, layer in enumerate(self.roberta.encoder.layer):
            # replace the `modeling_bert.BertSelfAttention` object with `LongformerSelfAttention`
            layer.attention.self = RobertaLongSelfAttention(config, layer_id=i)

def create_long_model(save_model_to, attention_window, max_pos):
    model = XLMRobertaForSequenceClassification.from_pretrained('xlm-roberta-base',
                                                  gradient_checkpointing=True)
    tokenizer = XLMRobertaTokenizer.from_pretrained('xlm-roberta-base',
                                                    model_max_length=max_pos,
                                                    gradient_checkpointing=True)
    config = model.config

    # extend position embeddings
    tokenizer.model_max_length = max_pos
    tokenizer.init_kwargs['model_max_length'] = max_pos
    current_max_pos, embed_size = model.roberta.embeddings.position_embeddings.weight.shape
    max_pos += 2  # NOTE: RoBERTa has positions 0,1 reserved, so embedding size is max position + 2
    config.max_position_embeddings = max_pos
    assert max_pos > current_max_pos
    # allocate a larger position embedding matrix
    new_pos_embed = model.roberta.embeddings.position_embeddings.weight.new_empty(max_pos, embed_size)
    # copy position embeddings over and over to initialize the new position embeddings
    k = 2
    step = current_max_pos - 2
    while k < max_pos - 1:
        new_pos_embed[k:(k + step)] = model.roberta.embeddings.position_embeddings.weight[2:]
        k += step
    model.roberta.embeddings.position_embeddings.weight.data = new_pos_embed

    # replace the `modeling_bert.BertSelfAttention` object with `LongformerSelfAttention`
    config.attention_window = [attention_window] * config.num_hidden_layers
    for i, layer in enumerate(model.roberta.encoder.layer):
        longformer_self_attn = LongformerSelfAttention(config, layer_id=i)
        longformer_self_attn.query = layer.attention.self.query
        longformer_self_attn.key = layer.attention.self.key
        longformer_self_attn.value = layer.attention.self.value

        longformer_self_attn.query_global = layer.attention.self.query
        longformer_self_attn.key_global = layer.attention.self.key
        longformer_self_attn.value_global = layer.attention.self.value

        layer.attention.self = longformer_self_attn

    logger.info(f'saving model to {save_model_to}')
    model.save_pretrained(save_model_to)
    tokenizer.save_pretrained(save_model_to)
    return model, tokenizer

and here’s how I load the model once I save it to a path. Currently I am using the position length of 1024:

logger.info(f'Converting xlm-roberta-base into xlm-roberta-base-{model_args.max_pos}')
model, tokenizer = create_long_model(
    save_model_to=model_path, attention_window=model_args.attention_window, max_pos=model_args.max_pos)

tokenizer = XLMRobertaTokenizer.from_pretrained(model_path, gradient_checkpointing=True)
model_roberta = RobertaLongForSequenceClassification.from_pretrained(model_path)

Issue Analytics

State:
Created 3 years ago
Reactions:4
Comments:7

Top GitHub Comments

4reactions

shanybarhomcommented, Oct 21, 2020

@pranav-ust @ibeltagy I found a solution that works for me: Replace: new_pos_embed = model.roberta.embeddings.position_embeddings.weight.new_empty(max_pos, embed_size) with: new_pos_embed = model.roberta.embeddings.position_embeddings.weight.new_zeros(max_pos, embed_size)

The idea behind that is that for some reason the first line of code works well for RoBERTa (creates tensor of zeros) but for the XLM-R it creates a tensor that contains “trashy” values (very big numbers). As the first two position embeddings are not initialized, this causes NaNs values in the padding vector.

0reactions

MarkusSagencommented, Nov 24, 2020

@pranav-ust @ibeltagy I found a solution that works for me: Replace: new_pos_embed = model.roberta.embeddings.position_embeddings.weight.new_empty(max_pos, embed_size) with: new_pos_embed = model.roberta.embeddings.position_embeddings.weight.new_zeros(max_pos, embed_size)

The idea behind that is that for some reason the first line of code works well for RoBERTa (creates tensor of zeros) but for the XLM-R it creates a tensor that contains “trashy” values (very big numbers). As the first two position embeddings are not initialized, this causes NaNs values in the padding vector.

Nice! I solved this by lowering the gradient clipping from 5.0 to 1.0. Also got NaN when training a full XLM-R model otherwise

Top Results From Across the Web

Fine-tuning a pretrained model - Hugging Face

In this tutorial, we will show you how to fine-tune a pretrained model from the Transformers library. In TensorFlow, models can be directly...

Longformer version of RoBERTa error · Issue #9588 - GitHub

Model I am using script to initialize Longformer starting from HerBERT ... field from transformers import RobertaForMaskedLM, XLMTokenizer, ...

Why I am getting tensor of NaN values in PyTorch ...

I am fine-tuning distil-bert model for 200k iterations. Once it saves the checkpoint file, I do the inference. However, my inference vector ...

Top 1805 resources for bert models - NLP Hub - Metatext

This is a complete list of resources about Bert Models for your next project in natural language processing. Found 1805 Bert. Let's get...

A Survey of Controllable Text Generation using Transformer ...

As discussed in Section 2.3, "PLM + fine-tuning" has become a new paradigm in the general field of NLP. First, a large amount...

Troubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.

Start Free

Top Related Reddit Thread

No results found

Top Related Tweet

No results found

Top Related Dev.to Post

No results found

Finetuning XLM Longformer gets a lot of NaN values

Issue Analytics

Top GitHub Comments

Top Results From Across the Web

Top Related Medium Post

Top Related StackOverflow Question

Troubleshoot Live Code

Top Related Reddit Thread

Top Related Hackernoon Post

Top Related Tweet

Top Related Dev.to Post

Top Related Hashnode Post

Longformer is not converted to ONNX format.

should both: attention_mask and global_attention_mask be used for classification?