Segmentation fault for 5 (or more) gpus training
See original GitHub issueWhen I am testing the model pretrain demo with 5 or more gpus parallelly, I meet the segmentation fault. But it works properly in 4 or less gpus.
Here is the demo code:
import torch
from longformer.longformer import Longformer, LongformerConfig, LongformerForMaskedLM2
from longformer.sliding_chunks import pad_to_window_size
from transformers import RobertaTokenizer
import utils
import numpy as np
from pytorch_optimization import get_optimization
import os
os.environ["CUDA_VISIBLE_DEVICES"] = '0,1,2,3,4'
config = LongformerConfig.from_pretrained('./longformer-large-4096/')
config.attention_mode = 'tvm'
longformer = Longformer(config=config)
model = LongformerForMaskedLM2(config, longformer)
utils.torch_init_model(model, 'longformer-large-4096/pytorch_model.bin')
tokenizer = RobertaTokenizer(vocab_file='roberta_large/vocab.json',
merges_file='roberta_large/merges.txt')
tokenizer.model_max_length = config.max_position_embeddings
SAMPLE_TEXT = ' '.join(['Hello world'] * 750) # long input document
input_ids = torch.tensor(tokenizer.encode(SAMPLE_TEXT)).unsqueeze(0) # batch of size 1
print(input_ids.shape)
model.half()
# TVM code doesn't work on CPU.
# Uncomment this if `config.attention_mode = 'tvm'`
model = model.cuda()
optimizer = get_optimization(model=model,
float16=True,
learning_rate=3e-5,
total_steps=10000,
schedule='warmup_linear',
warmup_rate=0.1,
max_grad_norm=1.0,
weight_decay_rate=0.01)
model = torch.nn.DataParallel(model)
input_ids = input_ids.cuda()
# Attention mask values -- 0: no attention, 1: local attention, 2: global attention
attention_mask = torch.ones(input_ids.shape, dtype=torch.long, device=input_ids.device) # initialize to local attention
# attention_mask[:, [1, 1023, ]] = 2 # Set global attention based on the task. For example,
# classification: the <s> token
# QA: question tokens
# padding seqlen to the nearest multiple of 512. Needed for the 'sliding_chunks' attention
input_ids, attention_mask = pad_to_window_size(input_ids, attention_mask, config.attention_window[0],
tokenizer.pad_token_id)
print(input_ids.shape, attention_mask.shape)
masked_positions = np.random.choice(np.arange(0, input_ids.shape[1]), 300, replace=False)
masked_positions = torch.tensor(masked_positions).unsqueeze(0).cuda()
masked_lm_labels = torch.tensor(np.random.randint(0, 50000, masked_positions.shape)).cuda()
for i in range(10000):
loss = model(input_ids=input_ids.repeat(5, 1),
attention_mask=attention_mask.repeat(5, 1),
masked_positions=masked_positions.repeat(5, 1),
masked_lm_labels=masked_lm_labels.repeat(5, 1))
if loss.shape[0] > 1:
loss = loss.mean()
loss_value = loss.item()
print('Step:{}/10000, Loss:{}'.format(i, loss_value))
optimizer.backward(loss)
optimizer.step()
model.zero_grad()
Here is the error:
It works in 4 GPUs successfully:
Issue Analytics
- State:
- Created 3 years ago
- Reactions:1
- Comments:6
Top Results From Across the Web
Tensorflow segmentation fault with single machine multiple ...
Recently, I am trying to learn how to use Tensorflow to do the data parallel training and I found a toy example here...
Read more >Segmentation fault at training network - Jetson TX2
I am getting segmentation fault while training my neural network. $ python tools/train_lanenet.py. The output is as following:.
Read more >Segmentation Fault - Notebook
When a large dataset is getting trained, with the higher number of epochs, the kernel dies and doesn't get restarted. So, what would...
Read more >Multigpu, Segmentation fault
I encountered this problem when training with multi gpu after a few epochs. Sometimes the error is “Segmentation fault (core dumped)”.
Read more >Unable to start training process Segmentation fault (core ...
Hi, I am trying to start training proccess on latest DeepSpeech release, but getting seg fault every time. my flags CUDA_VISIBLE_DEVICES=2 python3 -u ......
Read more >Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start FreeTop Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Top GitHub Comments
I have trained a longformer-base model in chinese successfully. And it works well. Thanks for your advice and codes again @ibeltagy !
lol, I don’t know exactly what the problem is but it seems to be related to
tvm.load
, and loading the binaries when starting the job and before calling any functions seem to address it.I will keep it uncommented for all tvm experiments, fp16 or fp32