question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Bus error. Insufficient shared memory?

See original GitHub issue

Describe the bug When workers_per_gpu>0 in the config file, the error will arise. I searched for some keywords of the error, only the way to set workers_per_gpu to 0 can work, I know it means to make data preparation and training serial, which was time-consuming.

Reproduction

  1. What command or script did you run?
"""a piece of code, dcn_best.py is a config file"""
config_file = 'dcn_best.py'
output_path = "/cos_person/275/1745/object_detection/output/"

#train
os.system('python '+mmdetection_path+'tools/train.py '+main_path+'code/'+config_file+' --gpus 1 --work_dir '+output_path)
  1. Did you make any modifications on the code or config? Did you understand what you have modified? I changed some part for suitability for another dataset. I commented out SyncBN (#norm_cfg=norm_cfg), because it designed for distributed training and I only had access to one GPU (P40) (please let me know if I can use SyncBN on a single GPU for a better result).
  2. What dataset did you use? a small part of Open Images

Environment

  • OS: [linux-x86_64]
  • PyTorch version [3.5]
  • How you installed PyTorch [pip]
  • GPU model [P40] (the environment was on cloud and was’t set by me, I will be to find more accurate information if you need)

Error traceback

ERROR: Unexpected bus error encountered in worker. This might be caused by insufficient shared memory (shm).
�Traceback (most recent call last):
  File "/opt/conda/envs/pytorch-py3.6/lib/python3.6/site-packages/torch/utils/data/dataloader.py", line 511, in _try_get_batch
    data = self.data_queue.get(timeout=timeout)
  File "/opt/conda/envs/pytorch-py3.6/lib/python3.6/multiprocessing/queues.py", line 104, in get
    if not self._poll(timeout):
  File "/opt/conda/envs/pytorch-py3.6/lib/python3.6/multiprocessing/connection.py", line 257, in poll
    return self._poll(timeout)
  File "/opt/conda/envs/pytorch-py3.6/lib/python3.6/multiprocessing/connection.py", line 414, in _poll
    r = wait([self], timeout)
  File "/opt/conda/envs/pytorch-py3.6/lib/python3.6/multiprocessing/connection.py", line 911, in wait
    ready = selector.select(timeout)
  File "/opt/conda/envs/pytorch-py3.6/lib/python3.6/selectors.py", line 376, in select
    fd_event_list = self._poll.poll(timeout)
  File "/opt/conda/envs/pytorch-py3.6/lib/python3.6/site-packages/torch/utils/data/_utils/signal_handling.py", line 63, in handler
    _error_if_any_worker_fails()
RuntimeError: DataLoader worker (pid 3958) is killed by signal: Bus error. 

Bug fix The memory was 15G with 2 CPUs. Insufficient shared memory? I’m not sure.

And thanks for your great work, I can already see improvement after some training, probably I will continuously use mmdetection for tasks with high accuracy requirements.

Issue Analytics

  • State:closed
  • Created 4 years ago
  • Comments:9

github_iconTop GitHub Comments

2reactions
Pattoriocommented, Jul 29, 2019

Never set worker_per_gpu=0, or you will spend LOOOONG time for training.

Did you use docker? When you run docker, add --shm-size="8g" or any size you need but more than 64M.

0reactions
FishLikeApplecommented, Jul 29, 2019

Never set worker_per_gpu=0, or you will spend LOOOONG time for training.

Did you use docker? When you run docker, add --shm-size="8g" or any size you need but more than 64M.

Thanks, I’m using a cloud plaform which provides a virtual Linux machine. Now I’m using batch size 1 with GN (it’s good to know that mmdetection has supported GN). I think technically GN doesn’t depend on batch size.

Read more comments on GitHub >

github_iconTop Results From Across the Web

Unexpected bus error encountered in worker. This might be ...
ERROR: Unexpected bus error encountered in worker. This might be caused by insufficient shared memory (shm). #283.
Read more >
Training crashes due to - Insufficient shared memory (shm)
ERROR: Unexpected bus error encountered in worker. This might be caused by insufficient shared memory (shm). Here are my system's shared ...
Read more >
increase pytorch shared memory | Data Science and ... - Kaggle
Error from kernel log ERROR: Unexpected bus error encountered in worker. This might be caused by insufficient shared memory (shm). I also tried...
Read more >
PyTorch Dataset leaking memory with basic I/O operation
RuntimeError: DataLoader worker (pid 10666) is killed by signal: Bus error. It is possible that dataloader's workers are out of shared memory.
Read more >
ryanfb/kraken - Docker Image
ERROR: Unexpected bus error encountered in worker. This might be caused by insufficient shared memory (shm). This is because Docker defaults to 64MB...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found