Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Training gets stuck when using SyncBN

See original GitHub issue

DistributedDataParallel works great for me. But when I use it together with the synchronized batch normalization, either the Python version or the optimized version, the training will get stuck after a few iterations and the code gives the following warning:

/home/heilaw/.conda/envs/CornerNet/lib/python3.6/multiprocessing/semaphore_tracker.py:143: UserWarning: semaphore_tracker: There appear to be 1 leaked semaphores to clean up at shutdown len(cache))

Any idea how I should debug it?

Issue Analytics

State:
Created 5 years ago
Comments:19 (2 by maintainers)

Top GitHub Comments

3reactions

heilawcommented, Dec 20, 2018

I think this issue is related to process_group = group_creator() in optimized_sync_batchnorm_kernel.py. In parallel/__init__.py, you set group_creator to new_group if get_default_group is not available. However, I don’t think that’s a good idea. get_default_group is not available in PyTorch 1.0, so that creates a new group every time we call the sync BN forward function! It looks like we are using the default group anyway. We may not need that line.

After I removed that line and process_group in both torch.distributed.all_reduce and torch.distributed.all_gather, the training now works, even with tqdm.

0reactions

Lamplycommented, Mar 3, 2021

I met the same issue. But in my case, I also use torch.distributed.all_reduce() in my training loop under condition like:

if args.local_rank == 0:
    torch.distributed.all_reduce(loss)

After fixing couple of mistakes like this, the problem disappeared.

Top Results From Across the Web

Validation hangs up when using DDP and syncbatchnorm

I'm using DDP(one process per GPU) to training a 3D UNet. ... The validation is probably getting stuck since SyncBatchNorm on rank 0...

Clara Training Framework - NVIDIA Documentation Center

Clara provides a training framework to help accelerate deep learning training and inference for medical imaging use cases. It allows medical imaging researchers ......

Training stuck (multi GPU, transformer) - Support - OpenNMT

I use the following command/setup: ... What happens is that it halts (gets stuck) at 'starting training loop without validation…'.

PyTorch 1.7.0 Now Available | Exxact Blog

This feature will abort stuck collectives and throw an exception/crash the process if a potential hang is detected. When used with something ...

Train a model — MMSegmentation 0.29.1 documentation

experimental support (Convert SyncBN to BN): ... The process of training on the CPU is consistent with single GPU training if machine does...