Bus error. Insufficient shared memory?
See original GitHub issueDescribe the bug When workers_per_gpu>0 in the config file, the error will arise. I searched for some keywords of the error, only the way to set workers_per_gpu to 0 can work, I know it means to make data preparation and training serial, which was time-consuming.
Reproduction
- What command or script did you run?
"""a piece of code, dcn_best.py is a config file"""
config_file = 'dcn_best.py'
output_path = "/cos_person/275/1745/object_detection/output/"
#train
os.system('python '+mmdetection_path+'tools/train.py '+main_path+'code/'+config_file+' --gpus 1 --work_dir '+output_path)
- Did you make any modifications on the code or config? Did you understand what you have modified? I changed some part for suitability for another dataset. I commented out SyncBN (#norm_cfg=norm_cfg), because it designed for distributed training and I only had access to one GPU (P40) (please let me know if I can use SyncBN on a single GPU for a better result).
- What dataset did you use? a small part of Open Images
Environment
- OS: [linux-x86_64]
- PyTorch version [3.5]
- How you installed PyTorch [pip]
- GPU model [P40] (the environment was on cloud and was’t set by me, I will be to find more accurate information if you need)
Error traceback
ERROR: Unexpected bus error encountered in worker. This might be caused by insufficient shared memory (shm).
�Traceback (most recent call last):
File "/opt/conda/envs/pytorch-py3.6/lib/python3.6/site-packages/torch/utils/data/dataloader.py", line 511, in _try_get_batch
data = self.data_queue.get(timeout=timeout)
File "/opt/conda/envs/pytorch-py3.6/lib/python3.6/multiprocessing/queues.py", line 104, in get
if not self._poll(timeout):
File "/opt/conda/envs/pytorch-py3.6/lib/python3.6/multiprocessing/connection.py", line 257, in poll
return self._poll(timeout)
File "/opt/conda/envs/pytorch-py3.6/lib/python3.6/multiprocessing/connection.py", line 414, in _poll
r = wait([self], timeout)
File "/opt/conda/envs/pytorch-py3.6/lib/python3.6/multiprocessing/connection.py", line 911, in wait
ready = selector.select(timeout)
File "/opt/conda/envs/pytorch-py3.6/lib/python3.6/selectors.py", line 376, in select
fd_event_list = self._poll.poll(timeout)
File "/opt/conda/envs/pytorch-py3.6/lib/python3.6/site-packages/torch/utils/data/_utils/signal_handling.py", line 63, in handler
_error_if_any_worker_fails()
RuntimeError: DataLoader worker (pid 3958) is killed by signal: Bus error.
Bug fix The memory was 15G with 2 CPUs. Insufficient shared memory? I’m not sure.
And thanks for your great work, I can already see improvement after some training, probably I will continuously use mmdetection for tasks with high accuracy requirements.
Issue Analytics
- State:
- Created 4 years ago
- Comments:9
Top Results From Across the Web
Unexpected bus error encountered in worker. This might be ...
ERROR: Unexpected bus error encountered in worker. This might be caused by insufficient shared memory (shm). #283.
Read more >Training crashes due to - Insufficient shared memory (shm)
ERROR: Unexpected bus error encountered in worker. This might be caused by insufficient shared memory (shm). Here are my system's shared ...
Read more >increase pytorch shared memory | Data Science and ... - Kaggle
Error from kernel log ERROR: Unexpected bus error encountered in worker. This might be caused by insufficient shared memory (shm). I also tried...
Read more >PyTorch Dataset leaking memory with basic I/O operation
RuntimeError: DataLoader worker (pid 10666) is killed by signal: Bus error. It is possible that dataloader's workers are out of shared memory.
Read more >ryanfb/kraken - Docker Image
ERROR: Unexpected bus error encountered in worker. This might be caused by insufficient shared memory (shm). This is because Docker defaults to 64MB...
Read more >Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start FreeTop Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Top GitHub Comments
Never set worker_per_gpu=0, or you will spend LOOOONG time for training.
Did you use docker? When you run docker, add
--shm-size="8g"
or any size you need but more than 64M.Thanks, I’m using a cloud plaform which provides a virtual Linux machine. Now I’m using batch size 1 with GN (it’s good to know that mmdetection has supported GN). I think technically GN doesn’t depend on batch size.