DistributedDataParallel only works with volatile gradients.
See original GitHub issueHi, I tried to train a model using multi-gpus. The code works fine on single gpu but when i try to use multiple gpus then i get the following error:
Traceback (most recent call last):
File "train.py", line 302, in <module>
loss.backward()
File "/export/gpudata/wangbo/conda/envs/py36/lib/python3.6/site-packages/torch/autograd/variable.py", line 167, in backward
torch.autograd.backward(self, gradient, retain_graph, create_graph, retain_variables)
File "/export/gpudata/wangbo/conda/envs/py36/lib/python3.6/site-packages/torch/autograd/__init__.py", line 99, in backward
variables, grad_variables, retain_graph)
File "/export/gpudata/wangbo/conda/envs/py36/lib/python3.6/site-packages/torch/nn/parallel/distributed.py", line 215, in distributed_data_parallel_hook
raise RuntimeError("DistributedDataParallel only works with volatile gradients")
RuntimeError: DistributedDataParallel only works with volatile gradients
terminate called after throwing an instance of 'gloo::EnforceNotMet'
It seems that these model parameters after computed with CTCLoss function became non-volatile. It’s reasonless because the code works fine on single gpu. I thought there are something wrong with DistributedDataParallel parameter setting? I am using python3.6 with pytorch 0.3.0 and with cuda8.0 and cudnn7.0. Could you tell how did you run this multi-gpus training?
Issue Analytics
- State:
- Created 5 years ago
- Comments:6 (1 by maintainers)
Top Results From Across the Web
DistributedDataParallel behaves weirdly - PyTorch Forums
Hi, to speed up my training I was looking into pytorches DistributedDataParallel, since the docs state that DataParallel has a lot of ...
Read more >torch.nn.parallel.distributed — PyTorch master documentation
[docs]class DistributedDataParallel(Module): r"""Implements distributed ... it will only work if gradients are to be accumulated in ``.grad`` attributes of ...
Read more >How distributed training works in Pytorch - AI Summer
Parameters are never broadcasted between processes, only gradients. The module is replicated on each machine and each device. During the forward ...
Read more >Pytorch - The Theory behind Distributed Data Parallel
Another fact to consider is also that, if only training a subgraph of the neural network, not all gradients of the neural network...
Read more >Gradient Synchronization - Hugging Face
This happens when the model is wrapped with DistributedDataParallel : ... (len(dataloader) - 1): with ddp_model.no_sync(): # Gradients only accumulate ...
Read more >Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start FreeTop Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Top GitHub Comments
@Diamondfan Bravo, it works
Problems may be here.
Try this:
ctx.grads = Variable(grads, volatile=True)
Cause using DDP in pytorch0.3.1, it requires the grads to be volatile. Here