Training gets stuck when using SyncBN
See original GitHub issueDistributedDataParallel
works great for me. But when I use it together with the synchronized batch normalization, either the Python version or the optimized version, the training will get stuck after a few iterations and the code gives the following warning:
/home/heilaw/.conda/envs/CornerNet/lib/python3.6/multiprocessing/semaphore_tracker.py:143: UserWarning: semaphore_tracker: There appear to be 1 leaked semaphores to clean up at shutdown len(cache))
Any idea how I should debug it?
Issue Analytics
- State:
- Created 5 years ago
- Comments:19 (2 by maintainers)
Top Results From Across the Web
Validation hangs up when using DDP and syncbatchnorm
I'm using DDP(one process per GPU) to training a 3D UNet. ... The validation is probably getting stuck since SyncBatchNorm on rank 0...
Read more >Clara Training Framework - NVIDIA Documentation Center
Clara provides a training framework to help accelerate deep learning training and inference for medical imaging use cases. It allows medical imaging researchers ......
Read more >Training stuck (multi GPU, transformer) - Support - OpenNMT
I use the following command/setup: ... What happens is that it halts (gets stuck) at 'starting training loop without validation…'.
Read more >PyTorch 1.7.0 Now Available | Exxact Blog
This feature will abort stuck collectives and throw an exception/crash the process if a potential hang is detected. When used with something ...
Read more >Train a model — MMSegmentation 0.29.1 documentation
experimental support (Convert SyncBN to BN): ... The process of training on the CPU is consistent with single GPU training if machine does...
Read more >
Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free
Top Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
I think this issue is related to
process_group = group_creator()
inoptimized_sync_batchnorm_kernel.py
. Inparallel/__init__.py
, you setgroup_creator
tonew_group
ifget_default_group
is not available. However, I don’t think that’s a good idea.get_default_group
is not available in PyTorch 1.0, so that creates a new group every time we call the sync BN forward function! It looks like we are using the default group anyway. We may not need that line.After I removed that line and
process_group
in bothtorch.distributed.all_reduce
andtorch.distributed.all_gather
, the training now works, even withtqdm
.I met the same issue. But in my case, I also use
torch.distributed.all_reduce()
in my training loop under condition like:After fixing couple of mistakes like this, the problem disappeared.