Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Keep getting error: segmentation fault

See original GitHub issue

I was trying to train an MaskRCNN model. However, I kept running into “Segmentation fault” and the whole process would just shut down spontaneously. I’ve tried compiling PyTorch from source and upgrade my gcc to 7.3.0, but neither of them seemed to help. The environment I’m currently working on is CentOS, gcc 7.3.0, PyTorch 1.3.0 and torchvision 0.4.1. The traceback is as following:

2019-11-07 21:54:34,120 - INFO - workflow: [('train', 1)], max: 12 epochs
loading annotations into memory...                                                                                                                              Done (t=0.00s)                                                                                                                                                  creating index...                                                                                                                                               index created!                                                                                                                                                  Traceback (most recent call last):
  File "./tools/train.py", line 108, in <module>
    main()
  File "./tools/train.py", line 104, in main
    logger=logger)                                                                                                                                                File "/home/teddybear12/mmdetection/mmdet/apis/train.py", line 60, in train_detector
    _non_dist_train(model, dataset, cfg, validate=validate)
  File "/home/teddybear12/mmdetection/mmdet/apis/train.py", line 227, in _non_dist_train
    runner.run(data_loaders, cfg.workflow, cfg.total_epochs)                                                                                                      File "/home/teddybear12/.conda/envs/open-mmlab/lib/python3.7/site-packages/mmcv-0.2.14-py3.7-linux-x86_64.egg/mmcv/runner/runner.py", line 358, in run
    epoch_runner(data_loaders[i], **kwargs)
  File "/home/teddybear12/.conda/envs/open-mmlab/lib/python3.7/site-packages/mmcv-0.2.14-py3.7-linux-x86_64.egg/mmcv/runner/runner.py", line 264, in train
    self.model, data_batch, train_mode=True, **kwargs)
  File "/home/teddybear12/mmdetection/mmdet/apis/train.py", line 38, in batch_processor
    losses = model(**data)
  File "/home/teddybear12/.conda/envs/open-mmlab/lib/python3.7/site-packages/torch/nn/modules/module.py", line 541, in __call__
    result = self.forward(*input, **kwargs)
  File "/home/teddybear12/.conda/envs/open-mmlab/lib/python3.7/site-packages/torch/nn/parallel/data_parallel.py", line 150, in forward
    return self.module(*inputs[0], **kwargs[0])
  File "/home/teddybear12/.conda/envs/open-mmlab/lib/python3.7/site-packages/torch/nn/modules/module.py", line 541, in __call__
    result = self.forward(*input, **kwargs)
  File "/home/teddybear12/mmdetection/mmdet/core/fp16/decorators.py", line 49, in new_func
    return old_func(*args, **kwargs)
  File "/home/teddybear12/mmdetection/mmdet/models/detectors/base.py", line 100, in forward
    return self.forward_train(img, img_meta, **kwargs)
  File "/home/teddybear12/mmdetection/mmdet/models/detectors/two_stage.py", line 182, in forward_train
    proposal_list = self.rpn_head.get_bboxes(*proposal_inputs)
  File "/home/teddybear12/mmdetection/mmdet/core/fp16/decorators.py", line 127, in new_func
    return old_func(*args, **kwargs)
  File "/home/teddybear12/mmdetection/mmdet/models/anchor_heads/anchor_head.py", line 272, in get_bboxes
    scale_factor, cfg, rescale)
  File "/home/teddybear12/mmdetection/mmdet/models/anchor_heads/rpn_head.py", line 92, in get_bboxes_single
    proposals, _ = nms(proposals, cfg.nms_thr)
 File "/home/teddybear12/mmdetection/mmdet/ops/nms/nms_wrapper.py", line 43, in nms
    inds = nms_cuda.nms(dets_th, iou_thr)
RuntimeError: CUDA error: invalid device function (launch_kernel at /pytorch/aten/src/ATen/native/cuda/Loops.cuh:102)
frame #0: c10::Error::Error(c10::SourceLocation, std::string const&) + 0x33 (0x2ac5f02de813 in /home/teddybear12/.conda/envs/open-mmlab/lib/python3.7/site-packages/torch/lib/libc10.so)
frame #1: void at::native::gpu_index_kernel<__nv_dl_wrapper_t<__nv_dl_tag<void (*)(at::TensorIterator&, c10::ArrayRef<long>, c10::ArrayRef<long>), &(void at::native::index_kernel_impl<at::native::OpaqueType<8> >(at::TensorIterator&, c10::ArrayRef<long>, c10::ArrayRef<long>)), 1u>> >(at::TensorIterator&, c10::ArrayRef<long>, c10::ArrayRef<long>, __nv_dl_wrapper_t<__nv_dl_tag<void (*)(at::TensorIterator&, c10::ArrayRef<long>, c10::ArrayRef<long>), &(void at::native::index_kernel_impl<at::native::OpaqueType<8> >(at::TensorIterator&, c10::ArrayRef<long>, c10::ArrayRef<long>)), 1u>> const&) + 0x7bb (0x2ac5aa7636bb in /home/teddybear12/.conda/envs/open-mmlab/lib/python3.7/site-packages/torch/lib/libtorch.so)
frame #2: <unknown function> + 0x5481902 (0x2ac5aa75d902 in /home/teddybear12/.conda/envs/open-mmlab/lib/python3.7/site-packages/torch/lib/libtorch.so)
frame #3: <unknown function> + 0x5481dc8 (0x2ac5aa75ddc8 in /home/teddybear12/.conda/envs/open-mmlab/lib/python3.7/site-packages/torch/lib/libtorch.so)
frame #4: <unknown function> + 0x1aa972b (0x2ac5a6d8572b in /home/teddybear12/.conda/envs/open-mmlab/lib/python3.7/site-packages/torch/lib/libtorch.so)
frame #5: at::native::index(at::Tensor const&, c10::ArrayRef<at::Tensor>) + 0x44e (0x2ac5a6d8055e in /home/teddybear12/.conda/envs/open-mmlab/lib/python3.7/site-packages/torch/lib/libtorch.so)
frame #6: <unknown function> + 0x1fa087a (0x2ac5a727c87a in /home/teddybear12/.conda/envs/open-mmlab/lib/python3.7/site-packages/torch/lib/libtorch.so)
frame #7: <unknown function> + 0x3a72fdd (0x2ac5a8d4efdd in /home/teddybear12/.conda/envs/open-mmlab/lib/python3.7/site-packages/torch/lib/libtorch.so)
frame #8: at::Tensor::index(c10::ArrayRef<at::Tensor>) const + 0xbb (0x2ac5a8a15d6b in /home/teddybear12/.conda/envs/open-mmlab/lib/python3.7/site-packages/torch/lib/libtorch.so)
frame #9: nms_cuda(at::Tensor, float) + 0x730 (0x2ac60bd22a44 in /home/teddybear12/mmdetection/mmdet/ops/nms/nms_cuda.cpython-37m-x86_64-linux-gnu.so)
frame #10: nms(at::Tensor const&, float) + 0xdf (0x2ac60bd11faf in /home/teddybear12/mmdetection/mmdet/ops/nms/nms_cuda.cpython-37m-x86_64-linux-gnu.so)
frame #11: <unknown function> + 0x3bc9d (0x2ac60bd20c9d in /home/teddybear12/mmdetection/mmdet/ops/nms/nms_cuda.cpython-37m-x86_64-linux-gnu.so)
frame #12: <unknown function> + 0x393ba (0x2ac60bd1e3ba in /home/teddybear12/mmdetection/mmdet/ops/nms/nms_cuda.cpython-37m-x86_64-linux-gnu.so)
<omitting python frames>

/var/spool/slurm/d/job84751/slurm_script: line 12: 28620 Segmentation fault

What should I do? I can really use some help. This situation is very frustrating orz.

Issue Analytics

State:
Created 4 years ago
Comments:8 (1 by maintainers)

Top GitHub Comments

2reactions

ljc199504commented, Nov 13, 2019

Modify the version of pytorch to 1.1,and compile the setup.py to generate build.py, you might solve this problem.

0reactions

hellockcommented, Dec 14, 2019

@teddybear0212 Mostly caused by the environment. You may update to the latest version, remove the original build directory, rebuild it, and then run python tools/collect_env.py to collect some environment information for trouble shooting.

Top Results From Across the Web

c++ - What is a segmentation fault? - Stack Overflow

A segmentation fault occurs when a program attempts to access a memory location that it is not allowed to access, or attempts to...

Core Dump (Segmentation fault) in C/C++ - GeeksforGeeks

Core Dump/Segmentation fault is a specific kind of error caused by accessing memory that “does not belong to you.”.

Segmentation fault - Wikipedia

In computing, a segmentation fault (often shortened to segfault) or access violation is a fault, or failure condition, raised by hardware with memory ......

I keep getting a segmentation fault. What is going on? - Quora

Technically, a segmentation fault happens when software tries to read or write memory it doesn't have access to or that is in a...

List of Common Reasons for Segmentation Faults in C/C++

Accessing an array out of bounds · Dereferencing NULL pointers · Dereferencing freed memory · Dereferencing uninitialized pointers · Incorrect use ...