Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Replacing FrozenBatchNorm with SyncBatchNorm?

See original GitHub issue

🚀 Feature

Recently pytorch-nightly has a new feature: SyncBatchNorm

I have tried to replace all FrozenBatchNorm in maskrcnn_benchmark/modeling/backbone/resnet.py with the new SyncBN, but I find that the program crashes after several iterations. Here is the log file:

2019-03-13 14:51:26,344 maskrcnn_benchmark.trainer INFO: Start training
2019-03-13 14:51:46,113 maskrcnn_benchmark.trainer INFO: eta: 1 day, 0:42:08  iter: 20  loss: 24.5855 (nan)  loss_box_reg: 0.0073 (nan)  loss_classifier: 13.8356 (nan)  loss_mask: 8.8279 (nan)  loss_objectness: 0.6801 (14225872939778.2500)  loss_rpn_box_reg: 0.1516 (3601105699730.5649)  time: 0.7603 (0.9883)  data: 0.0348 (0.0980)  lr: 0.007173  max mem: 6825
2019-03-13 14:52:01,234 maskrcnn_benchmark.trainer INFO: eta: 21:47:42  iter: 40  loss: nan (nan)  loss_box_reg: nan (nan)  loss_classifier: nan (nan)  loss_mask: nan (nan)  loss_objectness: 0.5964 (7112936469889.4238)  loss_rpn_box_reg: 0.1185 (1800552849865.3418)  time: 0.7537 (0.8722)  data: 0.0345 (0.0664)  lr: 0.007707  max mem: 6825

As you can see, there loss becomes NaN after several iterations.

I have also tried to use the normal nn.BatchNorm2d, and enlarge the batch size, but the problem didn’t get solved neither.

So is it possible to use SyncBatchNorm here, in order to get a larger batch size?

By the way, I’m using 4 GPUS and I didn’t change anything else.

Issue Analytics

State:
Created 5 years ago
Comments:22 (8 by maintainers)

Top GitHub Comments

1reaction

zhangliliangcommented, May 12, 2019

BTW, the .pth is from https://pytorch.org/docs/stable/torchvision/models.html, which is trained after normalizing the input to [0, 1] install of [0, 255] Thus, it might be need to modified this config.

# Values to be used for image normalization
_C.INPUT.PIXEL_MEAN = [0.485, 0.456, 0.406]
# Values to be used for image normalization to
_C.INPUT.PIXEL_STD = [0.229, 0.224, 0.225]
# Convert image to BGR format (for Caffe2 models), in range 0-255
_C.INPUT.TO_BGR255 = False

1reaction

zhangliliangcommented, May 12, 2019

@kjgfcdb

The crashing problem might be caused by wrong weight initialization, i.e. loading the weight from R-50.pkl. The moving mean and var has been merge in scale and bias in the weights of R-50.pkl. When using FrozenBatchNorm, it is OK since its moving mean and var is 0 and 1. But for SyncBatchNorm or BatchNorm, it would caluate the moving mean and var on each training batch, thus it would have some problems.

The solution might be straight-forward, that using https://download.pytorch.org/models/resnet50-19c8e357.pth for pretraining, instead of the R-50.pkl.

Top Results From Across the Web

detectron2.layers

Convert all BatchNorm/SyncBatchNorm in module into FrozenBatchNorm. ... SyncBatchNorm has incorrect gradient when the batch size on each worker is different ...

How to change SyncBatchNorm - PyTorch Forums

i want to try the model on windows which is not supported to the distrubution. And i change the net = torch.nn.

Python torch.nn.SyncBatchNorm() Examples

This page shows Python examples of torch.nn.SyncBatchNorm.

Rethinking "Batch" in BatchNorm - arXiv Vanity

To study the behavior of BatchNorm, we replace the default 2fc box head ... to use frozen population statistics, also known as Frozen...

CVPR/regionclip-demo at main - Hugging Face

SyncBatchNorm (planes * self.expansion) self.downsample = nn. ... and convert all BatchNorm layers to FrozenBatchNorm Returns: the block itself """ for p in ......