Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Deadlock when using AMP with DistributedDataParallel

See original GitHub issue

I’ve run into an issue when using AMP with DistributedDataParallel that leads to a deadlock at backward(). I am initializing AMP in the proper order as specified in the docs/example (create models -> move model to GPU -> initialize optimizers -> initialize AMP -> wrap models in DDP) as well as using the with amp.scale_loss(loss_d, optimizer_d) as scaled_loss: syntax for loss scaling/backward. The model I’m working on is a GAN so I’m initializing AMP with a list of models as well as a list of optimizers and running the with amp.scale_loss() once for D and again for G (both are referencing their associated optimizers).

The code runs as expected without using amp, as well as when running with AMP on a single GPU (non-distributed). It is only when using mixed-precision and DDP that I run into this issue. As a note, I have the same issue with all mixed-precision option levels.

Some strange behavior presents when using DDP and AMP together beyond just the deadlock itself. My train loop updates D first, so a forward pass through G is performed, and then D loss is defined as loss real + loss fake. The forward pass through D takes a very long time (upwards of a minute on a V100), I get a gradient overflow warning if using dynamic scaling (It says its reducing the loss scaling), then it hangs on backward(). Again, when running the same code without DDP, everything works quite well (The training speed up and memory reduction are awesome).

Another thing that may be relevant is that I’m using gradient penalty in my loss so a call to torch.autograd.grad() is used to compute the GP. This doesn’t present any problems with DDP or AMP when used exclusively.

At the moment I can not provide code samples for reproduction, so understand that this issue doesn’t provide a lot of help in identifying the problem, but any thoughts or suggestions would be very helpful.

Edit: It is also worth mentioning that I am using PyTorch’s DDP, not Apex DDP. I have not tested using Apex DDP yet. Will try that shortly

Issue Analytics

State:
Created 5 years ago
Reactions:1
Comments:9 (4 by maintainers)

Top GitHub Comments

1reaction

mcarillicommented, Mar 15, 2019

Great! My suspicion is that Amp’s casts were sending tensors to the default device for each process (which, without set_device, would have been device 0 for all processes). I think I will continue to recommend set_device to other people, since calling set_device is also Pytorch’s official guidance for multiprocess traininghttps://pytorch.org/docs/stable/distributed.html#launch-utility). Also, if Torch DDP is working, I think you should stick with that rather than trying Apex DDP. I don’t think Apex DDP handles double-backward (e.g. gradient penalty) propertly.

I wanted to reiterate this because it is unrelated and may also be relevant:

Also, if you’re using a GAN, it’s possible you have more than one backward pass accumulating gradients for a given optimizer. If so, all but the last backward pass for that optimizer should have delay_unscale=True: https://nvidia.github.io/apex/advanced.html#multiple-backward-passes-per-iteration.

If you have multiple backward passes, but there is only one backward pass per optimizer, and each backward pass creates gradients for that particular optimizer, you don’t need to do supply delay_unscale=True.

0reactions

mdlockyercommented, Mar 15, 2019

Okay. Calling set_device() seems to have fixed this problem!

I appreciate your fast reply and your help solving this.

Top Results From Across the Web

DistributedDataParallel deadlock - PyTorch Forums

Im trying to use DistributedDataParallel to train a model in a cluster with nodes that each has 2 GPUs K80 and in total...

torch.nn.parallel.distributed — PyTorch master documentation

[docs]class DistributedDataParallel(Module): r"""Implements distributed data ... Here is how to use it: on each host with N GPUs, you should spawn up N ......

Changelog — PyTorch Lightning 1.8.5.post0 documentation

Fixed the gradient unscaling logic when using Trainer(precision=16) and fused ... Added native AMP support for ddp_fork (and associated alias strategies) ...

PyTorch 1.7.0 Now Available | Exxact Blog

PyTorch 1.7 introduces a new context manager to be used in conjunction with models trained using torch.nn.parallel.DistributedDataParallel to ...

DistributedDataParallel - PyTorch - W3cubDocs

To use DistributedDataParallel on a host with N GPUs, you should spawn up N ... AMP maintains stashed gradients that are used for...