question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Replacing FrozenBatchNorm with SyncBatchNorm?

See original GitHub issue

🚀 Feature

Recently pytorch-nightly has a new feature: SyncBatchNorm

I have tried to replace all FrozenBatchNorm in maskrcnn_benchmark/modeling/backbone/resnet.py with the new SyncBN, but I find that the program crashes after several iterations. Here is the log file:

2019-03-13 14:51:26,344 maskrcnn_benchmark.trainer INFO: Start training
2019-03-13 14:51:46,113 maskrcnn_benchmark.trainer INFO: eta: 1 day, 0:42:08  iter: 20  loss: 24.5855 (nan)  loss_box_reg: 0.0073 (nan)  loss_classifier: 13.8356 (nan)  loss_mask: 8.8279 (nan)  loss_objectness: 0.6801 (14225872939778.2500)  loss_rpn_box_reg: 0.1516 (3601105699730.5649)  time: 0.7603 (0.9883)  data: 0.0348 (0.0980)  lr: 0.007173  max mem: 6825
2019-03-13 14:52:01,234 maskrcnn_benchmark.trainer INFO: eta: 21:47:42  iter: 40  loss: nan (nan)  loss_box_reg: nan (nan)  loss_classifier: nan (nan)  loss_mask: nan (nan)  loss_objectness: 0.5964 (7112936469889.4238)  loss_rpn_box_reg: 0.1185 (1800552849865.3418)  time: 0.7537 (0.8722)  data: 0.0345 (0.0664)  lr: 0.007707  max mem: 6825

As you can see, there loss becomes NaN after several iterations.

I have also tried to use the normal nn.BatchNorm2d, and enlarge the batch size, but the problem didn’t get solved neither.

So is it possible to use SyncBatchNorm here, in order to get a larger batch size?

By the way, I’m using 4 GPUS and I didn’t change anything else.

Issue Analytics

  • State:closed
  • Created 5 years ago
  • Comments:22 (8 by maintainers)

github_iconTop GitHub Comments

1reaction
zhangliliangcommented, May 12, 2019

BTW, the .pth is from https://pytorch.org/docs/stable/torchvision/models.html, which is trained after normalizing the input to [0, 1] install of [0, 255] Thus, it might be need to modified this config.

# Values to be used for image normalization
_C.INPUT.PIXEL_MEAN = [0.485, 0.456, 0.406]
# Values to be used for image normalization to
_C.INPUT.PIXEL_STD = [0.229, 0.224, 0.225]
# Convert image to BGR format (for Caffe2 models), in range 0-255
_C.INPUT.TO_BGR255 = False
1reaction
zhangliliangcommented, May 12, 2019

@kjgfcdb

The crashing problem might be caused by wrong weight initialization, i.e. loading the weight from R-50.pkl. The moving mean and var has been merge in scale and bias in the weights of R-50.pkl. When using FrozenBatchNorm, it is OK since its moving mean and var is 0 and 1. But for SyncBatchNorm or BatchNorm, it would caluate the moving mean and var on each training batch, thus it would have some problems.

The solution might be straight-forward, that using https://download.pytorch.org/models/resnet50-19c8e357.pth for pretraining, instead of the R-50.pkl.

Read more comments on GitHub >

github_iconTop Results From Across the Web

detectron2.layers
Convert all BatchNorm/SyncBatchNorm in module into FrozenBatchNorm. ... SyncBatchNorm has incorrect gradient when the batch size on each worker is different ...
Read more >
How to change SyncBatchNorm - PyTorch Forums
i want to try the model on windows which is not supported to the distrubution. And i change the net = torch.nn.
Read more >
Python torch.nn.SyncBatchNorm() Examples
This page shows Python examples of torch.nn.SyncBatchNorm.
Read more >
Rethinking "Batch" in BatchNorm - arXiv Vanity
To study the behavior of BatchNorm, we replace the default 2fc box head ... to use frozen population statistics, also known as Frozen...
Read more >
CVPR/regionclip-demo at main - Hugging Face
SyncBatchNorm (planes * self.expansion) self.downsample = nn. ... and convert all BatchNorm layers to FrozenBatchNorm Returns: the block itself """ for p in ......
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found