Finetuning XLM Longformer gets a lot of NaN values
See original GitHub issueI am trying to follow your code for making custom longformer for XLM models (typically XLM-Roberta), however, I get NaN values as soon as I start training my models for a downstream classification. Hence, I am confused on where I am doing wrong here.
I am using Pytorch 1.6 and Transformers 3.1.
Here’s my basic structure of the code, note that I am making only changes of Roberta to XLM-Roberta here:
class RobertaLongSelfAttention(LongformerSelfAttention):
'''
Wrapper around Longformer's self attention
'''
def forward(
self,
hidden_states,
attention_mask=None,
head_mask=None,
encoder_hidden_states=None,
encoder_attention_mask=None,
output_attentions=False,
):
return super().forward(
hidden_states,
attention_mask=attention_mask,
output_attentions=output_attentions
)
class RobertaLongForSequenceClassification(XLMRobertaForSequenceClassification):
'''
Change all layers to have longer attention
'''
def __init__(self, config):
super().__init__(config)
for i, layer in enumerate(self.roberta.encoder.layer):
# replace the `modeling_bert.BertSelfAttention` object with `LongformerSelfAttention`
layer.attention.self = RobertaLongSelfAttention(config, layer_id=i)
def create_long_model(save_model_to, attention_window, max_pos):
model = XLMRobertaForSequenceClassification.from_pretrained('xlm-roberta-base',
gradient_checkpointing=True)
tokenizer = XLMRobertaTokenizer.from_pretrained('xlm-roberta-base',
model_max_length=max_pos,
gradient_checkpointing=True)
config = model.config
# extend position embeddings
tokenizer.model_max_length = max_pos
tokenizer.init_kwargs['model_max_length'] = max_pos
current_max_pos, embed_size = model.roberta.embeddings.position_embeddings.weight.shape
max_pos += 2 # NOTE: RoBERTa has positions 0,1 reserved, so embedding size is max position + 2
config.max_position_embeddings = max_pos
assert max_pos > current_max_pos
# allocate a larger position embedding matrix
new_pos_embed = model.roberta.embeddings.position_embeddings.weight.new_empty(max_pos, embed_size)
# copy position embeddings over and over to initialize the new position embeddings
k = 2
step = current_max_pos - 2
while k < max_pos - 1:
new_pos_embed[k:(k + step)] = model.roberta.embeddings.position_embeddings.weight[2:]
k += step
model.roberta.embeddings.position_embeddings.weight.data = new_pos_embed
# replace the `modeling_bert.BertSelfAttention` object with `LongformerSelfAttention`
config.attention_window = [attention_window] * config.num_hidden_layers
for i, layer in enumerate(model.roberta.encoder.layer):
longformer_self_attn = LongformerSelfAttention(config, layer_id=i)
longformer_self_attn.query = layer.attention.self.query
longformer_self_attn.key = layer.attention.self.key
longformer_self_attn.value = layer.attention.self.value
longformer_self_attn.query_global = layer.attention.self.query
longformer_self_attn.key_global = layer.attention.self.key
longformer_self_attn.value_global = layer.attention.self.value
layer.attention.self = longformer_self_attn
logger.info(f'saving model to {save_model_to}')
model.save_pretrained(save_model_to)
tokenizer.save_pretrained(save_model_to)
return model, tokenizer
and here’s how I load the model once I save it to a path. Currently I am using the position length of 1024:
logger.info(f'Converting xlm-roberta-base into xlm-roberta-base-{model_args.max_pos}')
model, tokenizer = create_long_model(
save_model_to=model_path, attention_window=model_args.attention_window, max_pos=model_args.max_pos)
tokenizer = XLMRobertaTokenizer.from_pretrained(model_path, gradient_checkpointing=True)
model_roberta = RobertaLongForSequenceClassification.from_pretrained(model_path)
Issue Analytics
- State:
- Created 3 years ago
- Reactions:4
- Comments:7
Top Results From Across the Web
Fine-tuning a pretrained model - Hugging Face
In this tutorial, we will show you how to fine-tune a pretrained model from the Transformers library. In TensorFlow, models can be directly...
Read more >Longformer version of RoBERTa error · Issue #9588 - GitHub
Model I am using script to initialize Longformer starting from HerBERT ... field from transformers import RobertaForMaskedLM, XLMTokenizer, ...
Read more >Why I am getting tensor of NaN values in PyTorch ...
I am fine-tuning distil-bert model for 200k iterations. Once it saves the checkpoint file, I do the inference. However, my inference vector ...
Read more >Top 1805 resources for bert models - NLP Hub - Metatext
This is a complete list of resources about Bert Models for your next project in natural language processing. Found 1805 Bert. Let's get...
Read more >A Survey of Controllable Text Generation using Transformer ...
As discussed in Section 2.3, "PLM + fine-tuning" has become a new paradigm in the general field of NLP. First, a large amount...
Read more >Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start FreeTop Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Top GitHub Comments
@pranav-ust @ibeltagy I found a solution that works for me: Replace:
new_pos_embed = model.roberta.embeddings.position_embeddings.weight.new_empty(max_pos, embed_size)
with:new_pos_embed = model.roberta.embeddings.position_embeddings.weight.new_zeros(max_pos, embed_size)
The idea behind that is that for some reason the first line of code works well for RoBERTa (creates tensor of zeros) but for the XLM-R it creates a tensor that contains “trashy” values (very big numbers). As the first two position embeddings are not initialized, this causes NaNs values in the padding vector.
Nice! I solved this by lowering the gradient clipping from 5.0 to 1.0. Also got NaN when training a full XLM-R model otherwise