Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Question about SyncBN

See original GitHub issue

Checklist

I have searched related issues but cannot get the expected help. #662 #682
The bug has not been fixed in the latest version.

Describe the bug I have changed GN to SyncBN as you said in #682, but it was stuck when num_shared_convs > 0 and did not have any error message.

Reproduction

I have modified configs/gn/mask_rcnn_r50_fpn_gn_2x.py as you said in #682. However, the problem did not appear when I set num_shared_convs=0.

norm_cfg = dict(type='SyncBN', requires_grad=True)
img_norm_cfg = dict(
    mean=[123.675, 116.28, 103.53], std=[58.395, 57.12, 57.375], to_rgb=True)
...
bbox_head=dict(
    type='ConvFCBBoxHead',
    num_shared_convs=4,
    num_shared_fcs=1,
    in_channels=256,
    conv_out_channels=256,
    fc_out_channels=1024,
    roi_feat_size=7,
    num_classes=2,
    target_means=[0., 0., 0., 0.],
    target_stds=[0.1, 0.1, 0.2, 0.2],
    reg_class_agnostic=False,
    norm_cfg=norm_cfg,
    loss_cls=dict(
        type='CrossEntropyLoss', use_sigmoid=False, loss_weight=1.0),
    loss_bbox=dict(type='SmoothL1Loss', beta=1.0, loss_weight=1.0)),

What dataset did you use? ICDAR 2017: a text detection dataset (I have switched it into coco-format)

Environment

OS: Ubuntu 16.04.6
GCC: 5.4.0
PyTorch version: 1.1.0
How you installed PyTorch: pip
GPU model: V100
CUDA 9.0 and CUDNN 7.0

Error traceback There is no error traceback and the program is stuck when num_shared_convs > 0

Issue Analytics

State:
Created 4 years ago
Reactions:1
Comments:6 (2 by maintainers)

Top GitHub Comments

2reactions

jylinscommented, Jul 8, 2019

Hi @hellock, I also find a potential solution, but it did not work after I updated the NVIDIA driver version from 390.77 to 396.26. And I forgot to mention that I had used OHEMSampler for RCNN. After switching from OHEMSampler to RandomSampler, the problem disappears. So I found that setting num_shared_convs > 0 in bbox_head and OHEMSampler in RCNN simultaneously will result in deadlock with SyncBN . The SyncBN config is as following:

# model settings
norm_cfg = dict(type='SyncBN', requires_grad=True)

model = dict(
    type='MaskRCNN',
    pretrained='modelzoo://resnet50',
    backbone=dict(
        ...),
    neck=dict(
        ...),
    rpn_head=dict(
        ...),
    bbox_roi_extractor=dict(
        ...),
    bbox_head=dict(
        type='ConvFCBBoxHead',
        num_shared_convs=4,
        num_shared_fcs=1,
        in_channels=256,
        conv_out_channels=256,
        fc_out_channels=1024,
        roi_feat_size=7,
        num_classes=2,
        target_means=[0., 0., 0., 0.],
        target_stds=[0.1, 0.1, 0.2, 0.2],
        reg_class_agnostic=False,
        norm_cfg=norm_cfg,
        loss_cls=dict(
            type='CrossEntropyLoss', use_sigmoid=False, loss_weight=1.0),
        loss_bbox=dict(type='SmoothL1Loss', beta=1.0, loss_weight=1.0)),
    mask_roi_extractor=dict(
        ...),
    mask_head=dict(
        ...))
# model training and testing settings
train_cfg = dict(
    rpn=dict(
        ...),
    rpn_proposal=dict(
        ...),
    rcnn=dict(
        assigner=dict(
            type='MaxIoUAssigner',
            pos_iou_thr=0.5,
            neg_iou_thr=0.5,
            min_pos_iou=0.5,
            ignore_iof_thr=0.5),
        sampler=dict(
            type='OHEMSampler',  # After using 'RandomSampler', the problem disappears.
            num=512,
            pos_fraction=0.25,
            neg_pos_ub=-1,
            add_gt_as_proposals=True),
        mask_size=28,
        pos_weight=-1,
        debug=False))
test_cfg = dict(
    rpn=dict(
        ...),
    rcnn=dict(
        ...))
# dataset settings
...
img_norm_cfg = dict(
    mean=[123.675, 116.28, 103.53], std=[58.395, 57.12, 57.375], to_rgb=True)
...

Maybe the problem results from official SyncBN implementation. Thanks for your reply anyway!

0reactions

yhcao6commented, Apr 20, 2020

Feel free to reopen it.

Top Results From Across the Web

SyncBatchNorm — PyTorch 1.13 documentation

Applies Batch Normalization over a N-Dimensional input (a mini-batch of [N-2]D inputs with additional channel dimension) as described in the paper Batch ...

Cross-Iteration Batch Normalization - Papers With Code

To address this problem, we present Cross-Iteration Batch Normalization (CBN), in which examples ... fixed BN, syncBN, 37.7, 58.5, 41.1, 22.0, 40.9, 49.0....

Train a model — MMSegmentation 0.29.1 documentation

Equivalently, you may also use 8 GPUs and 1 imgs/gpu since all models using cross-GPU SyncBN. To trade speed with GPU memory, you...

arXiv:2002.05712v3 [cs.LG] 25 Mar 2021

dress this problem, we present Cross-Iteration Batch Nor- ... On the other hand, synchronized BN (SyncBN) [18] yields.

CROSS-ITERATION BATCH NORMALIZATION - OpenReview

this problem through normalization of the network activations by their ... Cross-GPU Batch Normalization (CGBN or SyncBN) (Peng et al., 2018) extends BN....