Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

DistributedDataParallel only works with volatile gradients.

See original GitHub issue

Hi, I tried to train a model using multi-gpus. The code works fine on single gpu but when i try to use multiple gpus then i get the following error:

Traceback (most recent call last):
  File "train.py", line 302, in <module>
    loss.backward()
  File "/export/gpudata/wangbo/conda/envs/py36/lib/python3.6/site-packages/torch/autograd/variable.py", line 167, in backward
    torch.autograd.backward(self, gradient, retain_graph, create_graph, retain_variables)
  File "/export/gpudata/wangbo/conda/envs/py36/lib/python3.6/site-packages/torch/autograd/__init__.py", line 99, in backward
    variables, grad_variables, retain_graph)
  File "/export/gpudata/wangbo/conda/envs/py36/lib/python3.6/site-packages/torch/nn/parallel/distributed.py", line 215, in distributed_data_parallel_hook
    raise RuntimeError("DistributedDataParallel only works with volatile gradients")
RuntimeError: DistributedDataParallel only works with volatile gradients
terminate called after throwing an instance of 'gloo::EnforceNotMet'

It seems that these model parameters after computed with CTCLoss function became non-volatile. It’s reasonless because the code works fine on single gpu. I thought there are something wrong with DistributedDataParallel parameter setting? I am using python3.6 with pytorch 0.3.0 and with cuda8.0 and cudnn7.0. Could you tell how did you run this multi-gpus training?

Issue Analytics

State:
Created 5 years ago
Comments:6 (1 by maintainers)

Top GitHub Comments

1reaction

simonwang517commented, Jul 25, 2018

@Diamondfan Bravo, it works

1reaction

Diamondfancommented, Jul 25, 2018

Problems may be here.

Try this: ctx.grads = Variable(grads, volatile=True)

Cause using DDP in pytorch0.3.1, it requires the grads to be volatile. Here

Top Results From Across the Web

DistributedDataParallel behaves weirdly - PyTorch Forums

Hi, to speed up my training I was looking into pytorches DistributedDataParallel, since the docs state that DataParallel has a lot of ...

torch.nn.parallel.distributed — PyTorch master documentation

[docs]class DistributedDataParallel(Module): r"""Implements distributed ... it will only work if gradients are to be accumulated in ``.grad`` attributes of ...

How distributed training works in Pytorch - AI Summer

Parameters are never broadcasted between processes, only gradients. The module is replicated on each machine and each device. During the forward ...

Pytorch - The Theory behind Distributed Data Parallel

Another fact to consider is also that, if only training a subgraph of the neural network, not all gradients of the neural network...

Gradient Synchronization - Hugging Face

This happens when the model is wrapped with DistributedDataParallel : ... (len(dataloader) - 1): with ddp_model.no_sync(): # Gradients only accumulate ...