Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Bus error. Insufficient shared memory?

See original GitHub issue

Describe the bug When workers_per_gpu>0 in the config file, the error will arise. I searched for some keywords of the error, only the way to set workers_per_gpu to 0 can work, I know it means to make data preparation and training serial, which was time-consuming.

Reproduction

What command or script did you run?

"""a piece of code, dcn_best.py is a config file"""
config_file = 'dcn_best.py'
output_path = "/cos_person/275/1745/object_detection/output/"

#train
os.system('python '+mmdetection_path+'tools/train.py '+main_path+'code/'+config_file+' --gpus 1 --work_dir '+output_path)

Did you make any modifications on the code or config? Did you understand what you have modified? I changed some part for suitability for another dataset. I commented out SyncBN (#norm_cfg=norm_cfg), because it designed for distributed training and I only had access to one GPU (P40) (please let me know if I can use SyncBN on a single GPU for a better result).
What dataset did you use? a small part of Open Images

Environment

OS: [linux-x86_64]
PyTorch version [3.5]
How you installed PyTorch [pip]
GPU model [P40] (the environment was on cloud and was’t set by me, I will be to find more accurate information if you need)

Error traceback

ERROR: Unexpected bus error encountered in worker. This might be caused by insufficient shared memory (shm).
�Traceback (most recent call last):
  File "/opt/conda/envs/pytorch-py3.6/lib/python3.6/site-packages/torch/utils/data/dataloader.py", line 511, in _try_get_batch
    data = self.data_queue.get(timeout=timeout)
  File "/opt/conda/envs/pytorch-py3.6/lib/python3.6/multiprocessing/queues.py", line 104, in get
    if not self._poll(timeout):
  File "/opt/conda/envs/pytorch-py3.6/lib/python3.6/multiprocessing/connection.py", line 257, in poll
    return self._poll(timeout)
  File "/opt/conda/envs/pytorch-py3.6/lib/python3.6/multiprocessing/connection.py", line 414, in _poll
    r = wait([self], timeout)
  File "/opt/conda/envs/pytorch-py3.6/lib/python3.6/multiprocessing/connection.py", line 911, in wait
    ready = selector.select(timeout)
  File "/opt/conda/envs/pytorch-py3.6/lib/python3.6/selectors.py", line 376, in select
    fd_event_list = self._poll.poll(timeout)
  File "/opt/conda/envs/pytorch-py3.6/lib/python3.6/site-packages/torch/utils/data/_utils/signal_handling.py", line 63, in handler
    _error_if_any_worker_fails()
RuntimeError: DataLoader worker (pid 3958) is killed by signal: Bus error.

Bug fix The memory was 15G with 2 CPUs. Insufficient shared memory? I’m not sure.

And thanks for your great work, I can already see improvement after some training, probably I will continuously use mmdetection for tasks with high accuracy requirements.

Issue Analytics

State:
Created 4 years ago
Comments:9

Top GitHub Comments

2reactions

Pattoriocommented, Jul 29, 2019

Never set worker_per_gpu=0, or you will spend LOOOONG time for training.

Did you use docker? When you run docker, add --shm-size="8g" or any size you need but more than 64M.

0reactions

FishLikeApplecommented, Jul 29, 2019

Never set worker_per_gpu=0, or you will spend LOOOONG time for training.

Did you use docker? When you run docker, add --shm-size="8g" or any size you need but more than 64M.

Thanks, I’m using a cloud plaform which provides a virtual Linux machine. Now I’m using batch size 1 with GN (it’s good to know that mmdetection has supported GN). I think technically GN doesn’t depend on batch size.

Top Results From Across the Web

Unexpected bus error encountered in worker. This might be ...

ERROR: Unexpected bus error encountered in worker. This might be caused by insufficient shared memory (shm). #283.

Training crashes due to - Insufficient shared memory (shm)

ERROR: Unexpected bus error encountered in worker. This might be caused by insufficient shared memory (shm). Here are my system's shared ...

increase pytorch shared memory | Data Science and ... - Kaggle

Error from kernel log ERROR: Unexpected bus error encountered in worker. This might be caused by insufficient shared memory (shm). I also tried...

PyTorch Dataset leaking memory with basic I/O operation

RuntimeError: DataLoader worker (pid 10666) is killed by signal: Bus error. It is possible that dataloader's workers are out of shared memory.

ryanfb/kraken - Docker Image

ERROR: Unexpected bus error encountered in worker. This might be caused by insufficient shared memory (shm). This is because Docker defaults to 64MB...