Replacing FrozenBatchNorm with SyncBatchNorm?
See original GitHub issue🚀 Feature
Recently pytorch-nightly has a new feature: SyncBatchNorm
I have tried to replace all FrozenBatchNorm
in maskrcnn_benchmark/modeling/backbone/resnet.py
with the new SyncBN, but I find that the program crashes after several iterations. Here is the log file:
2019-03-13 14:51:26,344 maskrcnn_benchmark.trainer INFO: Start training
2019-03-13 14:51:46,113 maskrcnn_benchmark.trainer INFO: eta: 1 day, 0:42:08 iter: 20 loss: 24.5855 (nan) loss_box_reg: 0.0073 (nan) loss_classifier: 13.8356 (nan) loss_mask: 8.8279 (nan) loss_objectness: 0.6801 (14225872939778.2500) loss_rpn_box_reg: 0.1516 (3601105699730.5649) time: 0.7603 (0.9883) data: 0.0348 (0.0980) lr: 0.007173 max mem: 6825
2019-03-13 14:52:01,234 maskrcnn_benchmark.trainer INFO: eta: 21:47:42 iter: 40 loss: nan (nan) loss_box_reg: nan (nan) loss_classifier: nan (nan) loss_mask: nan (nan) loss_objectness: 0.5964 (7112936469889.4238) loss_rpn_box_reg: 0.1185 (1800552849865.3418) time: 0.7537 (0.8722) data: 0.0345 (0.0664) lr: 0.007707 max mem: 6825
As you can see, there loss becomes NaN
after several iterations.
I have also tried to use the normal nn.BatchNorm2d
, and enlarge the batch size, but the problem didn’t get solved neither.
So is it possible to use SyncBatchNorm
here, in order to get a larger batch size?
By the way, I’m using 4 GPUS and I didn’t change anything else.
Issue Analytics
- State:
- Created 5 years ago
- Comments:22 (8 by maintainers)
Top Results From Across the Web
detectron2.layers
Convert all BatchNorm/SyncBatchNorm in module into FrozenBatchNorm. ... SyncBatchNorm has incorrect gradient when the batch size on each worker is different ...
Read more >How to change SyncBatchNorm - PyTorch Forums
i want to try the model on windows which is not supported to the distrubution. And i change the net = torch.nn.
Read more >Python torch.nn.SyncBatchNorm() Examples
This page shows Python examples of torch.nn.SyncBatchNorm.
Read more >Rethinking "Batch" in BatchNorm - arXiv Vanity
To study the behavior of BatchNorm, we replace the default 2fc box head ... to use frozen population statistics, also known as Frozen...
Read more >CVPR/regionclip-demo at main - Hugging Face
SyncBatchNorm (planes * self.expansion) self.downsample = nn. ... and convert all BatchNorm layers to FrozenBatchNorm Returns: the block itself """ for p in ......
Read more >Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start FreeTop Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Top GitHub Comments
BTW, the .pth is from https://pytorch.org/docs/stable/torchvision/models.html, which is trained after normalizing the input to [0, 1] install of [0, 255] Thus, it might be need to modified this config.
@kjgfcdb
The crashing problem might be caused by wrong weight initialization, i.e. loading the weight from
R-50.pkl
. The moving mean and var has been merge in scale and bias in the weights ofR-50.pkl
. When usingFrozenBatchNorm
, it is OK since its moving mean and var is 0 and 1. But forSyncBatchNorm
orBatchNorm
, it would caluate the moving mean and var on each training batch, thus it would have some problems.The solution might be straight-forward, that using
https://download.pytorch.org/models/resnet50-19c8e357.pth
for pretraining, instead of theR-50.pkl
.