question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Segmentation fault

See original GitHub issue

The model fails to do a forward pass in the train step. The error reported is just “Segmentation fault” :-

dataset          cityscapes_train
batch_size       1
data_dir         ./dataset/cityscapes
data_list        ./dataset/list/cityscapes/train.lst
ignore_label     255
input_size       769,769
is_training      False
learning_rate    0.01
momentum         0.9
not_restore_last False
num_classes      19
start_iters      0
num_steps        40000
power            0.9
random_mirror    True
random_scale     True
random_seed      304
restore_from     ./pretrained_model/resnet101-imagenet.pth
save_num_images  2
save_pred_every  5000
snapshot_dir     checkpoint/snapshots_resnet101_asp_oc_dsn_1e-2_5e-4_8_40000/
weight_decay     0.0005
gpu              0,3,4
ohem_thres       0.7
ohem_thres1      0.8
ohem_thres2      0.5
use_weight       True
use_val          False
use_extra        False
ohem             False
ohem_keep        0
network          resnet101
method           asp_oc_dsn
reduce           True
ohem_single      False
use_parallel     False
dsn_weight       0.4
pair_weight      1
seed             304
output_path      ./seg_output_eval_set
store_output     False
use_flip         False
use_ms           False
predict_choice   whole
whole_scale      1
start_epochs     0
end_epochs       120
save_epoch       20
criterion        ce
eval             False
fix_lr           False
log_file         
use_normalize_transform False
/data/graphics/toyota-pytorch/OCNet/network/../oc_module/base_oc_block.py:69: UserWarning: nn.init.constant is now deprecated in favor of nn.init.constant_.
  nn.init.constant(self.W.weight, 0)
/data/graphics/toyota-pytorch/OCNet/network/../oc_module/base_oc_block.py:70: UserWarning: nn.init.constant is now deprecated in favor of nn.init.constant_.
  nn.init.constant(self.W.bias, 0)
/afs/csail.mit.edu/u/s/smadan/miniconda3/envs/py_36_tens_gpu/lib/python3.6/site-packages/torch/nn/parallel/data_parallel.py:24: UserWarning: 
    There is an imbalance between your GPUs. You may want to exclude GPU 3 which
    has less than 75% of the memory or cores of GPU 1. You can do so by setting
    the device_ids argument to DataParallel, or by setting the CUDA_VISIBLE_DEVICES
    environment variable.
  warnings.warn(imbalance_warn.format(device_ids[min_pos], device_ids[max_pos]))
w/ class balance
41650 images are loaded!
learning_rate: 0.01
torch.Size([1, 3, 769, 769])
Segmentation fault

I added a bunch of print statements and saw that the error is happening in the step

preds = model(images)

I checked the GPU usage, there was over 11GB of GPU memory free when the error occured, so it’s not a memory issue. Also, when I ran the .sh file initially, it was reporting errors because the directories for log/log_train and log_test were not created. I created them manually, and that error was resolved. But not, forward pass fails in the first iteration itself. Any leads?

Issue Analytics

  • State:open
  • Created 5 years ago
  • Comments:25

github_iconTop GitHub Comments

1reaction
lyxlynncommented, Oct 17, 2018

@Spandan-Madan Hi, I use the inplace-abn module from https://github.com/liutinglt/CE2P to replace the file ’ inplace_abn’ to solve the problem .

Best,

0reactions
iDzhcommented, Dec 10, 2019

@ackness ,I tried to run it according to your method, but there are still some problems, could you send me a working one? qq:2232661644, too many thanks!

Read more comments on GitHub >

github_iconTop Results From Across the Web

Segmentation fault - Wikipedia
Segmentation faults are a common class of error in programs written in languages like C that provide low-level memory access and few to...
Read more >
c++ - What is a segmentation fault? - Stack Overflow
A segmentation fault occurs when a program attempts to access a memory location that it is not allowed to access, or attempts to...
Read more >
Identify what's causing segmentation faults (segfaults)
A segmentation fault (aka segfault) is a common condition that causes programs to crash; they are often associated with a file named core...
Read more >
Core Dump (Segmentation fault) in C/C++ - GeeksforGeeks
When a piece of code tries to do read and write operation in a read only location in memory or freed block of...
Read more >
What Is a Segmentation Fault in Linux?
A segmentation fault, or segfault, is a memory error in which a program tries to access a memory address that does not exist...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found