question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Training gets stuck when using SyncBN

See original GitHub issue

DistributedDataParallel works great for me. But when I use it together with the synchronized batch normalization, either the Python version or the optimized version, the training will get stuck after a few iterations and the code gives the following warning:

/home/heilaw/.conda/envs/CornerNet/lib/python3.6/multiprocessing/semaphore_tracker.py:143: UserWarning: semaphore_tracker: There appear to be 1 leaked semaphores to clean up at shutdown len(cache))

Any idea how I should debug it?

Issue Analytics

  • State:open
  • Created 5 years ago
  • Comments:19 (2 by maintainers)

github_iconTop GitHub Comments

3reactions
heilawcommented, Dec 20, 2018

I think this issue is related to process_group = group_creator() in optimized_sync_batchnorm_kernel.py. In parallel/__init__.py, you set group_creator to new_group if get_default_group is not available. However, I don’t think that’s a good idea. get_default_group is not available in PyTorch 1.0, so that creates a new group every time we call the sync BN forward function! It looks like we are using the default group anyway. We may not need that line.

After I removed that line and process_group in both torch.distributed.all_reduce and torch.distributed.all_gather, the training now works, even with tqdm.

0reactions
Lamplycommented, Mar 3, 2021

I met the same issue. But in my case, I also use torch.distributed.all_reduce() in my training loop under condition like:

if args.local_rank == 0:
    torch.distributed.all_reduce(loss)

After fixing couple of mistakes like this, the problem disappeared.

Read more comments on GitHub >

github_iconTop Results From Across the Web

Validation hangs up when using DDP and syncbatchnorm
I'm using DDP(one process per GPU) to training a 3D UNet. ... The validation is probably getting stuck since SyncBatchNorm on rank 0...
Read more >
Clara Training Framework - NVIDIA Documentation Center
Clara provides a training framework to help accelerate deep learning training and inference for medical imaging use cases. It allows medical imaging researchers ......
Read more >
Training stuck (multi GPU, transformer) - Support - OpenNMT
I use the following command/setup: ... What happens is that it halts (gets stuck) at 'starting training loop without validation…'.
Read more >
PyTorch 1.7.0 Now Available | Exxact Blog
This feature will abort stuck collectives and throw an exception/crash the process if a potential hang is detected. When used with something ...
Read more >
Train a model — MMSegmentation 0.29.1 documentation
experimental support (Convert SyncBN to BN): ... The process of training on the CPU is consistent with single GPU training if machine does...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found