KeyError when using DDP
See original GitHub issueUsing DistributedDataParallel
results a KeyError with the following trace
Traceback (most recent call last):
File "training_script.py", line 328, in iteration
loss.backward()
File "/anaconda3/lib/python3.7/site-packages/torch/tensor.py", line 102, in backward
torch.autograd.backward(self, gradient, retain_graph, create_graph)
File "/anaconda3/lib/python3.7/site-packages/torch/autograd/__init__.py", line 90, in backward
allow_unreachable=True) # allow_unreachable flag
File "/anaconda3/lib/python3.7/site-packages/apex-0.1-py3.7-linux-x86_64.egg/apex/parallel/distributed.py", line 362, in allreduce_hook
self.comm_ready_buckets(param)
File "/anaconda3/lib/python3.7/site-packages/apex-0.1-py3.7-linux-x86_64.egg/apex/parallel/distributed.py", line 422, in comm_ready_buckets
bucket_idx, bucket_loc = self.param_id_to_bucket[id(param)]
KeyError: 139779067778824
pytorch: 1.0.0 apex compiled without --cpp_ext
Issue Analytics
- State:
- Created 5 years ago
- Comments:11 (4 by maintainers)
Top Results From Across the Web
Missing key error when train with DistributedDataParallel
The error is related to DDP initialization, not model loading. You should initialize a distributed process group before creating a DDP module.
Read more >Using DDP with fork fails - Training
Using the DDP example here DDP-based Scaling of Gaudi on PyTorch — Gaudi Documentation 1.6.0 documentation Tried to replace spawn with fork ...
Read more >KeyError: 'LAD is already registered in models'
I am facing this problem with trying to run CoLad train on my server: Ubuntu 20.4, python 3.7 CUDA 10.1. This error came...
Read more >多机多卡GPU分布式训练
使用NCCL进行多GPU训练(MULTI-GPU TRAINING WITH NCCL) ... 机器学习AI算法工程公众号:datayxDistributedDataParallel(DDP)是一个支持多机多卡、 ...
Read more >Multi node PyTorch Distributed Training Guide For People ...
A few examples that showcase the boilerplate of PyTorch DDP training code. Have each example work with torch.distributed.launch , torchrun ...
Read more >
Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free
Top Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Do you wrap both the top level module and the submodule in DDP separately? If so, that might result in some params having more than one allreduce hook registered…and i’m not sure what will happen.
Apex DDP should support partial freezing or unfreezing of your model between iterations. I THINK if you wrap only the top level module in DDP, then manually freeze (unfreeze) the parts of the model that you don’t want to train, by setting requires_grad=False (True) on the frozen (unfrozen) parts before beginning forward passes, then Apex DDP will detect that the set of non-frozen (requires_grad=True) params has changed, and recreate the bucket structure properly.
Another thing you can try is the delay_allreduce=True option, which waits until the end of backward and collects whatever gradients the model happened to create.
@mcarilli I’ve also been asked this. From my rough understanding, modifying grad for those are None with in the callback here would work in delayed case? Of course assuming the parameter you actually register_hook on won’t have None grad, otherwise NOTHING works(This might be a bug on its own. even without dynamic changing None grad) https://github.com/NVIDIA/apex/blob/master/apex/parallel/distributed.py#L404
also I know very little about
queue_callback
. It seems from test you can only do that in a hook? If there is a way @chenhuiji can insert grad modification in that queue manually, it will solve the problem.Another possible solution is just creating .grad for each parameter upfront and just zero them but never set them to None again(I don’t know what places it could happen but at least not do it manually to cheat speed). In this case no .grad should remain exist even with data dependency