question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Segmentation fault for 5 (or more) gpus training

See original GitHub issue

When I am testing the model pretrain demo with 5 or more gpus parallelly, I meet the segmentation fault. But it works properly in 4 or less gpus.

Here is the demo code:

import torch
from longformer.longformer import Longformer, LongformerConfig, LongformerForMaskedLM2
from longformer.sliding_chunks import pad_to_window_size
from transformers import RobertaTokenizer
import utils
import numpy as np
from pytorch_optimization import get_optimization
import os

os.environ["CUDA_VISIBLE_DEVICES"] = '0,1,2,3,4'
config = LongformerConfig.from_pretrained('./longformer-large-4096/')
config.attention_mode = 'tvm'

longformer = Longformer(config=config)
model = LongformerForMaskedLM2(config, longformer)
utils.torch_init_model(model, 'longformer-large-4096/pytorch_model.bin')
tokenizer = RobertaTokenizer(vocab_file='roberta_large/vocab.json',
                             merges_file='roberta_large/merges.txt')
tokenizer.model_max_length = config.max_position_embeddings

SAMPLE_TEXT = ' '.join(['Hello world'] * 750)  # long input document

input_ids = torch.tensor(tokenizer.encode(SAMPLE_TEXT)).unsqueeze(0)  # batch of size 1
print(input_ids.shape)
model.half()
# TVM code doesn't work on CPU.
# Uncomment this if `config.attention_mode = 'tvm'`
model = model.cuda()
optimizer = get_optimization(model=model,
                             float16=True,
                             learning_rate=3e-5,
                             total_steps=10000,
                             schedule='warmup_linear',
                             warmup_rate=0.1,
                             max_grad_norm=1.0,
                             weight_decay_rate=0.01)
model = torch.nn.DataParallel(model)
input_ids = input_ids.cuda()

# Attention mask values -- 0: no attention, 1: local attention, 2: global attention
attention_mask = torch.ones(input_ids.shape, dtype=torch.long, device=input_ids.device)  # initialize to local attention
# attention_mask[:, [1, 1023, ]] = 2  # Set global attention based on the task. For example,
# classification: the <s> token
# QA: question tokens

# padding seqlen to the nearest multiple of 512. Needed for the 'sliding_chunks' attention
input_ids, attention_mask = pad_to_window_size(input_ids, attention_mask, config.attention_window[0],
                                               tokenizer.pad_token_id)
print(input_ids.shape, attention_mask.shape)
masked_positions = np.random.choice(np.arange(0, input_ids.shape[1]), 300, replace=False)
masked_positions = torch.tensor(masked_positions).unsqueeze(0).cuda()
masked_lm_labels = torch.tensor(np.random.randint(0, 50000, masked_positions.shape)).cuda()

for i in range(10000):
    loss = model(input_ids=input_ids.repeat(5, 1),
                 attention_mask=attention_mask.repeat(5, 1),
                 masked_positions=masked_positions.repeat(5, 1),
                 masked_lm_labels=masked_lm_labels.repeat(5, 1))
    if loss.shape[0] > 1:
        loss = loss.mean()
    loss_value = loss.item()
    print('Step:{}/10000, Loss:{}'.format(i, loss_value))
    optimizer.backward(loss)
    optimizer.step()
    model.zero_grad()

Here is the error: image

It works in 4 GPUs successfully: image

Issue Analytics

  • State:closed
  • Created 3 years ago
  • Reactions:1
  • Comments:6

github_iconTop GitHub Comments

2reactions
ewrfcascommented, Jun 12, 2020

I have trained a longformer-base model in chinese successfully. And it works well. Thanks for your advice and codes again @ibeltagy !

1reaction
ibeltagycommented, Jun 3, 2020

OMG, it works!

lol, I don’t know exactly what the problem is but it seems to be related to tvm.load, and loading the binaries when starting the job and before calling any functions seem to address it.

If I use tvm in mixing fp16 and fp32, this line should be always uncommented?

I will keep it uncommented for all tvm experiments, fp16 or fp32

Read more comments on GitHub >

github_iconTop Results From Across the Web

Tensorflow segmentation fault with single machine multiple ...
Recently, I am trying to learn how to use Tensorflow to do the data parallel training and I found a toy example here...
Read more >
Segmentation fault at training network - Jetson TX2
I am getting segmentation fault while training my neural network. $ python tools/train_lanenet.py. The output is as following:.
Read more >
Segmentation Fault - Notebook
When a large dataset is getting trained, with the higher number of epochs, the kernel dies and doesn't get restarted. So, what would...
Read more >
Multigpu, Segmentation fault
I encountered this problem when training with multi gpu after a few epochs. Sometimes the error is “Segmentation fault (core dumped)”.
Read more >
Unable to start training process Segmentation fault (core ...
Hi, I am trying to start training proccess on latest DeepSpeech release, but getting seg fault every time. my flags CUDA_VISIBLE_DEVICES=2 python3 -u ......
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found