Question about SyncBN
See original GitHub issueChecklist
- I have searched related issues but cannot get the expected help. #662 #682
- The bug has not been fixed in the latest version.
Describe the bug
I have changed GN
to SyncBN
as you said in #682, but it was stuck when num_shared_convs > 0
and did not have any error message.
Reproduction
- I have modified
configs/gn/mask_rcnn_r50_fpn_gn_2x.py
as you said in #682. However, the problem did not appear when I setnum_shared_convs=0
.
norm_cfg = dict(type='SyncBN', requires_grad=True)
img_norm_cfg = dict(
mean=[123.675, 116.28, 103.53], std=[58.395, 57.12, 57.375], to_rgb=True)
...
bbox_head=dict(
type='ConvFCBBoxHead',
num_shared_convs=4,
num_shared_fcs=1,
in_channels=256,
conv_out_channels=256,
fc_out_channels=1024,
roi_feat_size=7,
num_classes=2,
target_means=[0., 0., 0., 0.],
target_stds=[0.1, 0.1, 0.2, 0.2],
reg_class_agnostic=False,
norm_cfg=norm_cfg,
loss_cls=dict(
type='CrossEntropyLoss', use_sigmoid=False, loss_weight=1.0),
loss_bbox=dict(type='SmoothL1Loss', beta=1.0, loss_weight=1.0)),
- What dataset did you use? ICDAR 2017: a text detection dataset (I have switched it into coco-format)
Environment
- OS: Ubuntu 16.04.6
- GCC: 5.4.0
- PyTorch version: 1.1.0
- How you installed PyTorch: pip
- GPU model: V100
- CUDA 9.0 and CUDNN 7.0
Error traceback
There is no error traceback and the program is stuck when num_shared_convs > 0
Issue Analytics
- State:
- Created 4 years ago
- Reactions:1
- Comments:6 (2 by maintainers)
Top Results From Across the Web
SyncBatchNorm — PyTorch 1.13 documentation
Applies Batch Normalization over a N-Dimensional input (a mini-batch of [N-2]D inputs with additional channel dimension) as described in the paper Batch ...
Read more >Cross-Iteration Batch Normalization - Papers With Code
To address this problem, we present Cross-Iteration Batch Normalization (CBN), in which examples ... fixed BN, syncBN, 37.7, 58.5, 41.1, 22.0, 40.9, 49.0....
Read more >Train a model — MMSegmentation 0.29.1 documentation
Equivalently, you may also use 8 GPUs and 1 imgs/gpu since all models using cross-GPU SyncBN. To trade speed with GPU memory, you...
Read more >arXiv:2002.05712v3 [cs.LG] 25 Mar 2021
dress this problem, we present Cross-Iteration Batch Nor- ... On the other hand, synchronized BN (SyncBN) [18] yields.
Read more >CROSS-ITERATION BATCH NORMALIZATION - OpenReview
this problem through normalization of the network activations by their ... Cross-GPU Batch Normalization (CGBN or SyncBN) (Peng et al., 2018) extends BN....
Read more >Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start FreeTop Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Top GitHub Comments
Hi @hellock, I also find a potential solution, but it did not work after I updated the NVIDIA driver version from 390.77 to 396.26. And I forgot to mention that I had used
OHEMSampler
for RCNN. After switching fromOHEMSampler
toRandomSampler
, the problem disappears. So I found that settingnum_shared_convs > 0
in bbox_head andOHEMSampler
in RCNN simultaneously will result in deadlock withSyncBN
. TheSyncBN
config is as following:Maybe the problem results from official SyncBN implementation. Thanks for your reply anyway!
Feel free to reopen it.