question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

DistributedDataParallel only works with volatile gradients.

See original GitHub issue

Hi, I tried to train a model using multi-gpus. The code works fine on single gpu but when i try to use multiple gpus then i get the following error:

Traceback (most recent call last):
  File "train.py", line 302, in <module>
    loss.backward()
  File "/export/gpudata/wangbo/conda/envs/py36/lib/python3.6/site-packages/torch/autograd/variable.py", line 167, in backward
    torch.autograd.backward(self, gradient, retain_graph, create_graph, retain_variables)
  File "/export/gpudata/wangbo/conda/envs/py36/lib/python3.6/site-packages/torch/autograd/__init__.py", line 99, in backward
    variables, grad_variables, retain_graph)
  File "/export/gpudata/wangbo/conda/envs/py36/lib/python3.6/site-packages/torch/nn/parallel/distributed.py", line 215, in distributed_data_parallel_hook
    raise RuntimeError("DistributedDataParallel only works with volatile gradients")
RuntimeError: DistributedDataParallel only works with volatile gradients
terminate called after throwing an instance of 'gloo::EnforceNotMet'

It seems that these model parameters after computed with CTCLoss function became non-volatile. It’s reasonless because the code works fine on single gpu. I thought there are something wrong with DistributedDataParallel parameter setting? I am using python3.6 with pytorch 0.3.0 and with cuda8.0 and cudnn7.0. Could you tell how did you run this multi-gpus training?

Issue Analytics

  • State:closed
  • Created 5 years ago
  • Comments:6 (1 by maintainers)

github_iconTop GitHub Comments

1reaction
simonwang517commented, Jul 25, 2018

@Diamondfan Bravo, it works

1reaction
Diamondfancommented, Jul 25, 2018

Problems may be here.

Try this: ctx.grads = Variable(grads, volatile=True)

Cause using DDP in pytorch0.3.1, it requires the grads to be volatile. Here

Read more comments on GitHub >

github_iconTop Results From Across the Web

DistributedDataParallel behaves weirdly - PyTorch Forums
Hi, to speed up my training I was looking into pytorches DistributedDataParallel, since the docs state that DataParallel has a lot of ...
Read more >
torch.nn.parallel.distributed — PyTorch master documentation
[docs]class DistributedDataParallel(Module): r"""Implements distributed ... it will only work if gradients are to be accumulated in ``.grad`` attributes of ...
Read more >
How distributed training works in Pytorch - AI Summer
Parameters are never broadcasted between processes, only gradients. The module is replicated on each machine and each device. During the forward ...
Read more >
Pytorch - The Theory behind Distributed Data Parallel
Another fact to consider is also that, if only training a subgraph of the neural network, not all gradients of the neural network...
Read more >
Gradient Synchronization - Hugging Face
This happens when the model is wrapped with DistributedDataParallel : ... (len(dataloader) - 1): with ddp_model.no_sync(): # Gradients only accumulate ...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found