Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Segmentation fault for 5 (or more) gpus training

See original GitHub issue

When I am testing the model pretrain demo with 5 or more gpus parallelly, I meet the segmentation fault. But it works properly in 4 or less gpus.

Here is the demo code:

import torch
from longformer.longformer import Longformer, LongformerConfig, LongformerForMaskedLM2
from longformer.sliding_chunks import pad_to_window_size
from transformers import RobertaTokenizer
import utils
import numpy as np
from pytorch_optimization import get_optimization
import os

os.environ["CUDA_VISIBLE_DEVICES"] = '0,1,2,3,4'
config = LongformerConfig.from_pretrained('./longformer-large-4096/')
config.attention_mode = 'tvm'

longformer = Longformer(config=config)
model = LongformerForMaskedLM2(config, longformer)
utils.torch_init_model(model, 'longformer-large-4096/pytorch_model.bin')
tokenizer = RobertaTokenizer(vocab_file='roberta_large/vocab.json',
                             merges_file='roberta_large/merges.txt')
tokenizer.model_max_length = config.max_position_embeddings

SAMPLE_TEXT = ' '.join(['Hello world'] * 750)  # long input document

input_ids = torch.tensor(tokenizer.encode(SAMPLE_TEXT)).unsqueeze(0)  # batch of size 1
print(input_ids.shape)
model.half()
# TVM code doesn't work on CPU.
# Uncomment this if `config.attention_mode = 'tvm'`
model = model.cuda()
optimizer = get_optimization(model=model,
                             float16=True,
                             learning_rate=3e-5,
                             total_steps=10000,
                             schedule='warmup_linear',
                             warmup_rate=0.1,
                             max_grad_norm=1.0,
                             weight_decay_rate=0.01)
model = torch.nn.DataParallel(model)
input_ids = input_ids.cuda()

# Attention mask values -- 0: no attention, 1: local attention, 2: global attention
attention_mask = torch.ones(input_ids.shape, dtype=torch.long, device=input_ids.device)  # initialize to local attention
# attention_mask[:, [1, 1023, ]] = 2  # Set global attention based on the task. For example,
# classification: the <s> token
# QA: question tokens

# padding seqlen to the nearest multiple of 512. Needed for the 'sliding_chunks' attention
input_ids, attention_mask = pad_to_window_size(input_ids, attention_mask, config.attention_window[0],
                                               tokenizer.pad_token_id)
print(input_ids.shape, attention_mask.shape)
masked_positions = np.random.choice(np.arange(0, input_ids.shape[1]), 300, replace=False)
masked_positions = torch.tensor(masked_positions).unsqueeze(0).cuda()
masked_lm_labels = torch.tensor(np.random.randint(0, 50000, masked_positions.shape)).cuda()

for i in range(10000):
    loss = model(input_ids=input_ids.repeat(5, 1),
                 attention_mask=attention_mask.repeat(5, 1),
                 masked_positions=masked_positions.repeat(5, 1),
                 masked_lm_labels=masked_lm_labels.repeat(5, 1))
    if loss.shape[0] > 1:
        loss = loss.mean()
    loss_value = loss.item()
    print('Step:{}/10000, Loss:{}'.format(i, loss_value))
    optimizer.backward(loss)
    optimizer.step()
    model.zero_grad()

Here is the error:

It works in 4 GPUs successfully:

Issue Analytics

State:
Created 3 years ago
Reactions:1
Comments:6

Top GitHub Comments

2reactions

ewrfcascommented, Jun 12, 2020

I have trained a longformer-base model in chinese successfully. And it works well. Thanks for your advice and codes again @ibeltagy !

1reaction

ibeltagycommented, Jun 3, 2020

OMG, it works!

lol, I don’t know exactly what the problem is but it seems to be related to tvm.load, and loading the binaries when starting the job and before calling any functions seem to address it.

If I use tvm in mixing fp16 and fp32, this line should be always uncommented?

I will keep it uncommented for all tvm experiments, fp16 or fp32

Top Results From Across the Web

Tensorflow segmentation fault with single machine multiple ...

Recently, I am trying to learn how to use Tensorflow to do the data parallel training and I found a toy example here...

Segmentation fault at training network - Jetson TX2

I am getting segmentation fault while training my neural network. $ python tools/train_lanenet.py. The output is as following:.

Segmentation Fault - Notebook

When a large dataset is getting trained, with the higher number of epochs, the kernel dies and doesn't get restarted. So, what would...

Multigpu, Segmentation fault

I encountered this problem when training with multi gpu after a few epochs. Sometimes the error is “Segmentation fault (core dumped)”.

Unable to start training process Segmentation fault (core ...

Hi, I am trying to start training proccess on latest DeepSpeech release, but getting seg fault every time. my flags CUDA_VISIBLE_DEVICES=2 python3 -u ......