question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Question about SyncBN

See original GitHub issue

Checklist

  1. I have searched related issues but cannot get the expected help. #662 #682
  2. The bug has not been fixed in the latest version.

Describe the bug I have changed GN to SyncBN as you said in #682, but it was stuck when num_shared_convs > 0 and did not have any error message.

Reproduction

  1. I have modified configs/gn/mask_rcnn_r50_fpn_gn_2x.py as you said in #682. However, the problem did not appear when I set num_shared_convs=0.
norm_cfg = dict(type='SyncBN', requires_grad=True)
img_norm_cfg = dict(
    mean=[123.675, 116.28, 103.53], std=[58.395, 57.12, 57.375], to_rgb=True)
...
bbox_head=dict(
    type='ConvFCBBoxHead',
    num_shared_convs=4,
    num_shared_fcs=1,
    in_channels=256,
    conv_out_channels=256,
    fc_out_channels=1024,
    roi_feat_size=7,
    num_classes=2,
    target_means=[0., 0., 0., 0.],
    target_stds=[0.1, 0.1, 0.2, 0.2],
    reg_class_agnostic=False,
    norm_cfg=norm_cfg,
    loss_cls=dict(
        type='CrossEntropyLoss', use_sigmoid=False, loss_weight=1.0),
    loss_bbox=dict(type='SmoothL1Loss', beta=1.0, loss_weight=1.0)),
  1. What dataset did you use? ICDAR 2017: a text detection dataset (I have switched it into coco-format)

Environment

  • OS: Ubuntu 16.04.6
  • GCC: 5.4.0
  • PyTorch version: 1.1.0
  • How you installed PyTorch: pip
  • GPU model: V100
  • CUDA 9.0 and CUDNN 7.0

Error traceback There is no error traceback and the program is stuck when num_shared_convs > 0

Issue Analytics

  • State:closed
  • Created 4 years ago
  • Reactions:1
  • Comments:6 (2 by maintainers)

github_iconTop GitHub Comments

2reactions
jylinscommented, Jul 8, 2019

Hi @hellock, I also find a potential solution, but it did not work after I updated the NVIDIA driver version from 390.77 to 396.26. And I forgot to mention that I had used OHEMSampler for RCNN. After switching from OHEMSampler to RandomSampler, the problem disappears. So I found that setting num_shared_convs > 0 in bbox_head and OHEMSampler in RCNN simultaneously will result in deadlock with SyncBN . The SyncBN config is as following:

# model settings
norm_cfg = dict(type='SyncBN', requires_grad=True)

model = dict(
    type='MaskRCNN',
    pretrained='modelzoo://resnet50',
    backbone=dict(
        ...),
    neck=dict(
        ...),
    rpn_head=dict(
        ...),
    bbox_roi_extractor=dict(
        ...),
    bbox_head=dict(
        type='ConvFCBBoxHead',
        num_shared_convs=4,
        num_shared_fcs=1,
        in_channels=256,
        conv_out_channels=256,
        fc_out_channels=1024,
        roi_feat_size=7,
        num_classes=2,
        target_means=[0., 0., 0., 0.],
        target_stds=[0.1, 0.1, 0.2, 0.2],
        reg_class_agnostic=False,
        norm_cfg=norm_cfg,
        loss_cls=dict(
            type='CrossEntropyLoss', use_sigmoid=False, loss_weight=1.0),
        loss_bbox=dict(type='SmoothL1Loss', beta=1.0, loss_weight=1.0)),
    mask_roi_extractor=dict(
        ...),
    mask_head=dict(
        ...))
# model training and testing settings
train_cfg = dict(
    rpn=dict(
        ...),
    rpn_proposal=dict(
        ...),
    rcnn=dict(
        assigner=dict(
            type='MaxIoUAssigner',
            pos_iou_thr=0.5,
            neg_iou_thr=0.5,
            min_pos_iou=0.5,
            ignore_iof_thr=0.5),
        sampler=dict(
            type='OHEMSampler',  # After using 'RandomSampler', the problem disappears.
            num=512,
            pos_fraction=0.25,
            neg_pos_ub=-1,
            add_gt_as_proposals=True),
        mask_size=28,
        pos_weight=-1,
        debug=False))
test_cfg = dict(
    rpn=dict(
        ...),
    rcnn=dict(
        ...))
# dataset settings
...
img_norm_cfg = dict(
    mean=[123.675, 116.28, 103.53], std=[58.395, 57.12, 57.375], to_rgb=True)
...

Maybe the problem results from official SyncBN implementation. Thanks for your reply anyway!

0reactions
yhcao6commented, Apr 20, 2020

Feel free to reopen it.

Read more comments on GitHub >

github_iconTop Results From Across the Web

SyncBatchNorm — PyTorch 1.13 documentation
Applies Batch Normalization over a N-Dimensional input (a mini-batch of [N-2]D inputs with additional channel dimension) as described in the paper Batch ...
Read more >
Cross-Iteration Batch Normalization - Papers With Code
To address this problem, we present Cross-Iteration Batch Normalization (CBN), in which examples ... fixed BN, syncBN, 37.7, 58.5, 41.1, 22.0, 40.9, 49.0....
Read more >
Train a model — MMSegmentation 0.29.1 documentation
Equivalently, you may also use 8 GPUs and 1 imgs/gpu since all models using cross-GPU SyncBN. To trade speed with GPU memory, you...
Read more >
arXiv:2002.05712v3 [cs.LG] 25 Mar 2021
dress this problem, we present Cross-Iteration Batch Nor- ... On the other hand, synchronized BN (SyncBN) [18] yields.
Read more >
CROSS-ITERATION BATCH NORMALIZATION - OpenReview
this problem through normalization of the network activations by their ... Cross-GPU Batch Normalization (CGBN or SyncBN) (Peng et al., 2018) extends BN....
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found