Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Segmentation fault

See original GitHub issue

The model fails to do a forward pass in the train step. The error reported is just “Segmentation fault” :-

dataset          cityscapes_train
batch_size       1
data_dir         ./dataset/cityscapes
data_list        ./dataset/list/cityscapes/train.lst
ignore_label     255
input_size       769,769
is_training      False
learning_rate    0.01
momentum         0.9
not_restore_last False
num_classes      19
start_iters      0
num_steps        40000
power            0.9
random_mirror    True
random_scale     True
random_seed      304
restore_from     ./pretrained_model/resnet101-imagenet.pth
save_num_images  2
save_pred_every  5000
snapshot_dir     checkpoint/snapshots_resnet101_asp_oc_dsn_1e-2_5e-4_8_40000/
weight_decay     0.0005
gpu              0,3,4
ohem_thres       0.7
ohem_thres1      0.8
ohem_thres2      0.5
use_weight       True
use_val          False
use_extra        False
ohem             False
ohem_keep        0
network          resnet101
method           asp_oc_dsn
reduce           True
ohem_single      False
use_parallel     False
dsn_weight       0.4
pair_weight      1
seed             304
output_path      ./seg_output_eval_set
store_output     False
use_flip         False
use_ms           False
predict_choice   whole
whole_scale      1
start_epochs     0
end_epochs       120
save_epoch       20
criterion        ce
eval             False
fix_lr           False
log_file         
use_normalize_transform False
/data/graphics/toyota-pytorch/OCNet/network/../oc_module/base_oc_block.py:69: UserWarning: nn.init.constant is now deprecated in favor of nn.init.constant_.
  nn.init.constant(self.W.weight, 0)
/data/graphics/toyota-pytorch/OCNet/network/../oc_module/base_oc_block.py:70: UserWarning: nn.init.constant is now deprecated in favor of nn.init.constant_.
  nn.init.constant(self.W.bias, 0)
/afs/csail.mit.edu/u/s/smadan/miniconda3/envs/py_36_tens_gpu/lib/python3.6/site-packages/torch/nn/parallel/data_parallel.py:24: UserWarning: 
    There is an imbalance between your GPUs. You may want to exclude GPU 3 which
    has less than 75% of the memory or cores of GPU 1. You can do so by setting
    the device_ids argument to DataParallel, or by setting the CUDA_VISIBLE_DEVICES
    environment variable.
  warnings.warn(imbalance_warn.format(device_ids[min_pos], device_ids[max_pos]))
w/ class balance
41650 images are loaded!
learning_rate: 0.01
torch.Size([1, 3, 769, 769])
Segmentation fault

I added a bunch of print statements and saw that the error is happening in the step

preds = model(images)

I checked the GPU usage, there was over 11GB of GPU memory free when the error occured, so it’s not a memory issue. Also, when I ran the .sh file initially, it was reporting errors because the directories for log/log_train and log_test were not created. I created them manually, and that error was resolved. But not, forward pass fails in the first iteration itself. Any leads?

Issue Analytics

State:
Created 5 years ago
Comments:25

Top GitHub Comments

1reaction

lyxlynncommented, Oct 17, 2018

@Spandan-Madan Hi, I use the inplace-abn module from https://github.com/liutinglt/CE2P to replace the file ’ inplace_abn’ to solve the problem .

Best,

0reactions

iDzhcommented, Dec 10, 2019

@ackness ，I tried to run it according to your method, but there are still some problems, could you send me a working one? qq:2232661644, too many thanks!