Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Segmentation fault (core dumped) during training with gcc4.9, cuda10.0, PyTorch 1.0.0.dev20190306

See original GitHub issue

🐛 Bug

Hi, I encountered segmentation fault (core dumped) before training:

2019-03-07 12:44:02,697 maskrcnn_benchmark.trainer INFO: Start training
Segmentation fault (core dumped)

Environment

PyTorch version: 1.0.0.dev20190306 Is debug build: No CUDA used to build PyTorch: 10.0.130

OS: CentOS Linux 7 (Core) GCC version: (GCC) 4.9.0 CMake version: version 3.13.3

Python version: 3.7 Is CUDA available: Yes CUDA runtime version: 10.0.130 GPU models and configuration: GPU 0: TITAN V GPU 1: TITAN V GPU 2: TITAN V GPU 3: TITAN V GPU 4: TITAN V GPU 5: TITAN V GPU 6: TITAN V GPU 7: TITAN V

Nvidia driver version: 410.78 cuDNN version: /usr/local/cuda-10.0/lib64/libcudnn.so.7

Versions of relevant libraries: [pip] numpy==1.16.2 [pip] torch==1.0.0.dev20190306 [pip] torchvision==0.2.3 [conda] blas 1.0 mkl https://mirrors.tuna.tsinghua.edu.cn/anaconda/pkgs/main [conda] mkl 2019.1 144 https://mirrors.tuna.tsinghua.edu.cn/anaconda/pkgs/main [conda] mkl_fft 1.0.10 py37ha843d7b_0 https://mirrors.tuna.tsinghua.edu.cn/anaconda/pkgs/main [conda] mkl_random 1.0.2 py37hd81dba3_0 https://mirrors.tuna.tsinghua.edu.cn/anaconda/pkgs/main [conda] pytorch-nightly 1.0.0.dev20190306 py3.7_cuda10.0.130_cudnn7.4.2_0 https://mirrors.tuna.tsinghua.edu.cn/anaconda/cloud/pytorch Pillow (5.4.1)

I’ve double checked the version of gcc, PyTorch and CUDA. And I’ve also tried gcc5.2, PyTorch-nightly but got the same error. And I rebuilt the project (rm -r build/) every time the setting changes. I’ve install and run maskrcnn-benchmark successfully on another linux machine following the same installing instructions. Help me please!

Following your advice before, I hope the following info can help…

The command I used:

 python tools/train_net.py --config-file "configs/e2e_mask_rcnn_R_50_FPN_1x.yaml" SOLVER.IMS_PER_BATCH 2 SOLVER.BASE_LR 0.0025 SOLVER.MAX_ITER 720000 SOLVER.STEPS "(480000, 640000)" TEST.IMS_PER_BATCH 1 MODEL.WEIGHT maskrcnn_benchmark/pretrained_model/e2e_mask_rcnn_R_50_FPN_1x.pth

The output:

Starting program: /home/gongke/anaconda3/envs/py36/bin/python tools/train_net.py --config-file "configs/e2e_mask_rcnn_R_50_FPN_1x.yaml" SOLVER.IMS_PER_BATCH 2 SOLVER.BASE_LR 0.0025 SOLVER.MAX_ITER 720000 SOLVER.STEPS "(480000, 640000)" TEST.IMS_PER_BATCH 1 MODEL.WEIGHT maskrcnn_benchmark/pretrained_model/e2e_mask_rcnn_R_50_FPN_1x.pth
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib64/libthread_db.so.1".
Missing separate debuginfo for /home/gongke/anaconda3/envs/py36/lib/python3.6/site-packages/numpy/../../../libiomp5.so
Try: yum --enablerepo='*debug*' install /usr/lib/debug/.build-id/50/126f8244a53ed88a1531546fcaa8dedc4bc85c.debug
Detaching after fork from child process 29904.
Detaching after fork from child process 29936.
Missing separate debuginfo for /home/gongke/anaconda3/envs/py36/lib/python3.6/site-packages/cv2/.libs/libz-a147dcb0.so.1.2.3
2019-03-07 13:16:35,697 maskrcnn_benchmark INFO: Using 1 GPUs
2019-03-07 13:16:35,697 maskrcnn_benchmark INFO: Namespace(config_file='configs/e2e_mask_rcnn_R_50_FPN_1x.yaml', distributed=False, local_rank=0, opts=['SOLVER.IMS_PER_BATCH', '2', 'SOLVER.BASE_LR', '0.0025', 'SOLVER.MAX_ITER', '720000', 'SOLVER.STEPS', '(480000, 640000)', 'TEST.IMS_PER_BATCH', '1', 'MODEL.WEIGHT', 'maskrcnn_benchmark/pretrained_model/e2e_mask_rcnn_R_50_FPN_1x.pth'], skip_test=False)
2019-03-07 13:16:35,697 maskrcnn_benchmark INFO: Collecting env info (might take some time)
Detaching after fork from child process 29940.
Detaching after fork from child process 29961.
[New Thread 0x7fff9276a700 (LWP 30029)]
Detaching after fork from child process 30030.
Detaching after fork from child process 30031.
Detaching after fork from child process 30049.
Detaching after fork from child process 30099.
Detaching after fork from child process 30105.
Detaching after fork from child process 30139.
Detaching after fork from child process 30144.
Detaching after fork from child process 30149.
Detaching after fork from child process 30154.
2019-03-07 13:16:50,207 maskrcnn_benchmark INFO:
PyTorch version: 1.0.0.dev20190306
Is debug build: No
CUDA used to build PyTorch: 10.0.130

OS: CentOS Linux 7 (Core)
GCC version: (GCC) 4.9.0
CMake version: version 3.13.3

Python version: 3.6
Is CUDA available: Yes
CUDA runtime version: 10.0.130
GPU models and configuration:
GPU 0: TITAN V
GPU 1: TITAN V
GPU 2: TITAN V
GPU 3: TITAN V
GPU 4: TITAN V
GPU 5: TITAN V
GPU 6: TITAN V
GPU 7: TITAN V

Nvidia driver version: 410.78
cuDNN version: /usr/local/cuda-10.0/lib64/libcudnn.so.7

Versions of relevant libraries:
[pip] deepvoice3-pytorch==0.1.1+cbf81cb
[pip] numpy==1.15.4
[pip] numpydoc==0.8.0
[pip] torch==1.0.0.dev20190306
[pip] torchvision==0.2.3
[conda] blas                      1.0                         mkl    defaults
[conda] cuda100                   1.0                           0    pytorch
[conda] deepvoice3-pytorch        0.1.1+cbf81cb             dev_0    <develop>
[conda] mkl                       2019.1                      144    defaults
[conda] mkl-service               1.1.2            py36he904b0f_5    defaults
[conda] mkl_fft                   1.0.6            py36hd81dba3_0    defaults
[conda] mkl_random                1.0.2            py36hd81dba3_0    defaults
[conda] pytorch-nightly           1.0.0.dev20190306 py3.6_cuda10.0.130_cudnn7.4.2_0    https://mirrors.tuna.tsinghua.edu.cn/anaconda/cloud/pytorch
        Pillow (5.3.0)
2019-03-07 13:16:50,209 maskrcnn_benchmark INFO: Loaded configuration file configs/e2e_mask_rcnn_R_50_FPN_1x.yaml
2019-03-07 13:16:50,210 maskrcnn_benchmark INFO:****
[New Thread 0x7fff77ca4700 (LWP 30203)]
[New Thread 0x7fff774a3700 (LWP 30313)]
2019-03-07 13:17:05,810 maskrcnn_benchmark.utils.checkpoint INFO: Loading checkpoint from maskrcnn_benchmark/pretrained_model/e2e_mask_rcnn_R_50_FPN_1x.pth
[New Thread 0x7fff91ee8780 (LWP 30730)]
[New Thread 0x7fff91ae6800 (LWP 30731)]
[New Thread 0x7fff916e4880 (LWP 30732)]
[New Thread 0x7fff912e2900 (LWP 30734)]
[New Thread 0x7fff90ee0980 (LWP 30735)]
[New Thread 0x7fff907d8a00 (LWP 30736)]
[New Thread 0x7fff76bb3a80 (LWP 30738)]
[New Thread 0x7fff767b1b00 (LWP 30740)]
[New Thread 0x7fff763afb80 (LWP 30743)]
[New Thread 0x7fff75fadc00 (LWP 30744)]
[New Thread 0x7fff75babc80 (LWP 30746)]
[New Thread 0x7fff757a9d00 (LWP 30747)]
[New Thread 0x7fff753a7d80 (LWP 30748)]
[New Thread 0x7fff74fa5e00 (LWP 30750)]
[New Thread 0x7fff74ba3e80 (LWP 30751)]
[New Thread 0x7fff747a1f00 (LWP 30752)]
[New Thread 0x7fff6dffef80 (LWP 30753)]
[New Thread 0x7fff6dbfd000 (LWP 30754)]
[New Thread 0x7fff6d7fb080 (LWP 30755)]
[New Thread 0x7fff6d3f9100 (LWP 30756)]
[New Thread 0x7fff6cff7180 (LWP 30757)]
[New Thread 0x7fff6cbf5200 (LWP 30758)]
[New Thread 0x7fff6c7f3280 (LWP 30759)]
2019-03-07 13:17:06,183 maskrcnn_benchmark.utils.model_serialization INFO: backbone.body.layer1.0.bn1.bias                   loaded from backbone.body.layer1.0.bn1.bias                   of shape (64,)
.........(loading model)
2019-03-07 13:17:06,654 maskrcnn_benchmark.data.build WARNING: When using more than one image per GPU you may encounter an out-of-memory (OOM) error if your GPU does not have sufficient memory. If this happens, you can reduce SOLVER.IMS_PER_BATCH (for training) or TEST.IMS_PER_BATCH (for inference). For training, you must also adjust the learning rate and schedule length according to the linear scaling rule. See for example: https://github.com/facebookresearch/Detectron/blob/master/configs/getting_started/tutorial_1gpu_e2e_faster_rcnn_R-50-FPN.yaml#L14
loading annotations into memory...
Done (t=13.27s)
creating index...
index created!
2019-03-07 13:17:22,932 maskrcnn_benchmark.trainer INFO: Start training
Detaching after fork from child process 30978.
Detaching after fork from child process 30981.
Detaching after fork from child process 30982.
Detaching after fork from child process 30983.
[New Thread 0x7ffe9213e700 (LWP 31011)]
[New Thread 0x7ffe90cbe700 (LWP 31012)]
[New Thread 0x7ffe8bfff700 (LWP 31013)]
[New Thread 0x7ffe8b7fe700 (LWP 31014)]

Program received signal SIGSEGV, Segmentation fault.
0x00007fffa0ab7cc2 in construct<_object*, _object*> (__p=0xb, this=0x55555687dd78) at /home/gongke/GCC-4.9.0/include/c++/4.9.0/ext/new_allocator.h:120
120             { ::new((void *)__p) _Up(std::forward<_Args>(__args)...); }
Missing separate debuginfos, use: debuginfo-install libICE-1.0.9-9.el7.x86_64 libSM-1.2.2-2.el7.x86_64 libX11-1.6.5-2.el7.x86_64 libXau-1.0.8-2.1.el7.x86_64 libXext-1.3.3-3.el7.x86_64 libXrender-0.9.10-1.el7.x86_64

Looking forward to your reply!

Issue Analytics

State:
Created 5 years ago
Comments:6 (3 by maintainers)

Top GitHub Comments

2reactions

Jacobewcommented, Mar 26, 2019

Finally I’ve solved this problem by switching my gcc version to 5.4.0. I’ve tried gcc-4.9.0 and gcc-5.2.0 and encountered the same error. For those who are struggling with this problem, I suggest you to try another version of gcc.

0reactions

skzhang1commented, Apr 22, 2019

@Jacobew Hello, may I have a deeper exchange? There are several questions I would like to ask you. This is my qq1581592445，thank you!