question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

RuntimeError: unable to open shared memory object </torch_29919_1396182366> in read-write mode

See original GitHub issue

🐛 Bug

Thanks for the maskrcnn-benchmark project, which is really an awesome job! However, I got some problems while training with my own instance segmentation dataset as described below.

To Reproduce

Steps to reproduce the behavior:

  1. I just replaced the original (instance segmentation) trainning dataset (e.g. COCO) with my own dataset. The format of my own dataset was organized as the same as that of COCO dataset, say, JSON file for annotations. And I used “R-101-FPN” for my instance segmentation task training with a single TITAN X GPU.
  2. To make the configuration correspond to my own dataset, I also modified the dataset configuration in the file of ~/maskrcnn-benchmark/maskrcnn_benchmark/config/paths_catalog.py. The main changes were the paths to link to my own dataset. And I don’t think these changes could lead to the training failure.
  3. Of course, ~/maskrcnn-benchmark/maskrcnn_benchmark/config/defaults.py and ~/maskrcnn-benchmark/configs/e2e_mask_rcnn_R_101_FPN_1_x.yaml were empoyed as default configuration. And I also set _C.SOLVER.IMS_PER_BATCH = 1 and _C.TEST.IMS_PER_BATCH = 1 .

Everything was ok during the training period at beginning. However, after several thousands of iteration, the training broke down. For simplicity, I paste the finall training output information here:

2018-11-03 11:11:27,514 maskrcnn_benchmark.trainer INFO: eta: 1 day, 0:51:50 iter: 6840 loss: 0.8033 (1.0337) loss_classifier: 0.2025 (0.2768) loss_box_reg: 0.1245 (0.1395) loss_mask: 0.3138 (0.4098) loss_objectness: 0.0600 (0.1297) loss_rpn_box_reg: 0.0195 (0.0779) time: 0.3105 (0.3053) data: 0.0067 (0.0130) lr: 0.002500 max mem: 4887 2018-11-03 11:11:33,930 maskrcnn_benchmark.trainer INFO: eta: 1 day, 0:51:57 iter: 6860 loss: 0.9135 (1.0337) loss_classifier: 0.1846 (0.2767) loss_box_reg: 0.0630 (0.1395) loss_mask: 0.3499 (0.4097) loss_objectness: 0.0861 (0.1298) loss_rpn_box_reg: 0.0168 (0.0780) time: 0.2981 (0.3054) data: 0.0064 (0.0130) lr: 0.002500 max mem: 4887 2018-11-03 11:11:40,246 maskrcnn_benchmark.trainer INFO: eta: 1 day, 0:52:00 iter: 6880 loss: 0.7548 (1.0331) loss_classifier: 0.1516 (0.2764) loss_box_reg: 0.0880 (0.1395) loss_mask: 0.3342 (0.4095) loss_objectness: 0.0588 (0.1298) loss_rpn_box_reg: 0.0457 (0.0780) time: 0.3046 (0.3054) data: 0.0064 (0.0130) lr: 0.002500 max mem: 4887 2018-11-03 11:11:46,088 maskrcnn_benchmark.trainer INFO: eta: 1 day, 0:51:43 iter: 6900 loss: 0.5536 (1.0324) loss_classifier: 0.1185 (0.2762) loss_box_reg: 0.0669 (0.1394) loss_mask: 0.2970 (0.4092) loss_objectness: 0.0445 (0.1297) loss_rpn_box_reg: 0.0095 (0.0779) time: 0.2823 (0.3054) data: 0.0048 (0.0130) lr: 0.002500 max mem: 4887 2018-11-03 11:11:52,392 maskrcnn_benchmark.trainer INFO: eta: 1 day, 0:51:45 iter: 6920 loss: 0.7813 (1.0319) loss_classifier: 0.1759 (0.2761) loss_box_reg: 0.0824 (0.1394) loss_mask: 0.3130 (0.4090) loss_objectness: 0.0393 (0.1295) loss_rpn_box_reg: 0.0133 (0.0779) time: 0.3052 (0.3054) data: 0.0061 (0.0129) lr: 0.002500 max mem: 4887 Traceback (most recent call last): File “/home/ly/sfw/anaconda3/lib/python3.7/multiprocessing/queues.py”, line 236, in _feed File “/home/ly/sfw/anaconda3/lib/python3.7/multiprocessing/reduction.py”, line 51, in dumps File “/home/ly/sfw/anaconda3/lib/python3.7/site-packages/torch/multiprocessing/reductions.py”, line 243, in reduce_storage RuntimeError: unable to open shared memory object </torch_29919_1396182366> in read-write mode Traceback (most recent call last): File “/home/ly/sfw/anaconda3/lib/python3.7/multiprocessing/resource_sharer.py”, line 149, in _serve send(conn, destination_pid) File “/home/ly/sfw/anaconda3/lib/python3.7/multiprocessing/resource_sharer.py”, line 50, in send reduction.send_handle(conn, new_fd, pid) File “/home/ly/sfw/anaconda3/lib/python3.7/multiprocessing/reduction.py”, line 179, in send_handle with socket.fromfd(conn.fileno(), socket.AF_UNIX, socket.SOCK_STREAM) as s: File “/home/ly/sfw/anaconda3/lib/python3.7/socket.py”, line 463, in fromfd nfd = dup(fd) OSError: [Errno 24] Too many open files Traceback (most recent call last): File “tools/train_net.py”, line 172, in <module> main() File “tools/train_net.py”, line 165, in main model = train(cfg, args.local_rank, args.distributed) File “tools/train_net.py”, line 74, in train arguments, File “/home/ly/projects/MaskRCNN/maskrcnn/maskrcnn_benchmark/engine/trainer.py”, line 56, in do_train for iteration, (images, targets, _) in enumerate(data_loader, start_iter): File “/home/ly/sfw/anaconda3/lib/python3.7/site-packages/torch/utils/data/dataloader.py”, line 631, in next idx, batch = self._get_batch() File “/home/ly/sfw/anaconda3/lib/python3.7/site-packages/torch/utils/data/dataloader.py”, line 610, in _get_batch return self.data_queue.get() File “/home/ly/sfw/anaconda3/lib/python3.7/multiprocessing/queues.py”, line 113, in get return _ForkingPickler.loads(res) File “/home/ly/sfw/anaconda3/lib/python3.7/site-packages/torch/multiprocessing/reductions.py”, line 204, in rebuild_storage_fd fd = df.detach() File “/home/ly/sfw/anaconda3/lib/python3.7/multiprocessing/resource_sharer.py”, line 58, in detach return reduction.recv_handle(conn) File “/home/ly/sfw/anaconda3/lib/python3.7/multiprocessing/reduction.py”, line 185, in recv_handle return recvfds(s, 1)[0] File “/home/ly/sfw/anaconda3/lib/python3.7/multiprocessing/reduction.py”, line 155, in recvfds raise EOFError EOFError

Environment

  • PyTorch Version (e.g., 1.0): 1.0
  • OS (e.g., Linux): 16.04
  • How you installed PyTorch (conda, pip, source): conda install pytorch-nightly -c pytorch
  • Build command you used (if compiling from source):
  • Python version: python 3.7
  • CUDA/cuDNN version: cuda 9.0 / cuDNN 7.1.2
  • GPU models and configuration: torch.cuda.set_device(3)
  • Any other relevant information:

What should I do to solve this problem? Thanks for your help!

Issue Analytics

  • State:open
  • Created 5 years ago
  • Comments:9 (2 by maintainers)

github_iconTop GitHub Comments

60reactions
u2400commented, Feb 25, 2021

I have the same issue, the following code can solve it

import torch.multiprocessing
torch.multiprocessing.set_sharing_strategy('file_system')

This code is from #11201

21reactions
IenLongcommented, Nov 4, 2018

Problems seem to be caused by the parameter of num_workers in torch.utils.data.DataLoader(…), which have been discussed intensively at https://github.com/pytorch/pytorch/issues/1355. In my investigations, Setting _C.DATALOADER.NUM_WORKERS > 0 may lead to errors mentioned above. Therefore, I made _C.DATALOADER.NUM_WORKERS = 0 and the training has been keeping on for tens of thousands of iterations without anything unusual happends. However, less num_works means more training time is needed.

Read more comments on GitHub >

github_iconTop Results From Across the Web

RuntimeError: unable to open shared ... - PyTorch Forums
I'm trying to use my library to train different models on a distributed way on a server containing 4 GPUs. For most models,...
Read more >
unable to open shared memory object, OSError: [Errno 24 ...
I having trouble with loading indexes of document. I am testing my code, so I set batch_size = 4 number_of_sentences_in_document = 84 ...
Read more >
RuntimeError: unable to open shared memory object ...
RuntimeError : unable to open shared memory object </torch_85159_194567204> in read-write mode. jjw_zyfx 于 2022-04-12 11:57:20 发布 1872 收藏.
Read more >
pytorch 分布式计算你们都遇到过哪些坑/bug? - 林小平的回答
RuntimeError : unable to open shared memory object </torch_24063_2365344576> in read-write mode. 完整信息:. -- Process 1 terminated with the following ...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found