question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

KeyError when using DDP

See original GitHub issue

Using DistributedDataParallel results a KeyError with the following trace

Traceback (most recent call last):
  File "training_script.py", line 328, in iteration
    loss.backward()
  File "/anaconda3/lib/python3.7/site-packages/torch/tensor.py", line 102, in backward
    torch.autograd.backward(self, gradient, retain_graph, create_graph)
  File "/anaconda3/lib/python3.7/site-packages/torch/autograd/__init__.py", line 90, in backward
    allow_unreachable=True)  # allow_unreachable flag
  File "/anaconda3/lib/python3.7/site-packages/apex-0.1-py3.7-linux-x86_64.egg/apex/parallel/distributed.py", line 362, in allreduce_hook
    self.comm_ready_buckets(param)
  File "/anaconda3/lib/python3.7/site-packages/apex-0.1-py3.7-linux-x86_64.egg/apex/parallel/distributed.py", line 422, in comm_ready_buckets
    bucket_idx, bucket_loc = self.param_id_to_bucket[id(param)]
KeyError: 139779067778824

pytorch: 1.0.0 apex compiled without --cpp_ext

Issue Analytics

  • State:open
  • Created 5 years ago
  • Comments:11 (4 by maintainers)

github_iconTop GitHub Comments

1reaction
mcarillicommented, Feb 10, 2019

Do you wrap both the top level module and the submodule in DDP separately? If so, that might result in some params having more than one allreduce hook registered…and i’m not sure what will happen.

Apex DDP should support partial freezing or unfreezing of your model between iterations. I THINK if you wrap only the top level module in DDP, then manually freeze (unfreeze) the parts of the model that you don’t want to train, by setting requires_grad=False (True) on the frozen (unfrozen) parts before beginning forward passes, then Apex DDP will detect that the set of non-frozen (requires_grad=True) params has changed, and recreate the bucket structure properly.

Another thing you can try is the delay_allreduce=True option, which waits until the end of backward and collects whatever gradients the model happened to create.

0reactions
FDecaYedcommented, Mar 2, 2019

@mcarilli I’ve also been asked this. From my rough understanding, modifying grad for those are None with in the callback here would work in delayed case? Of course assuming the parameter you actually register_hook on won’t have None grad, otherwise NOTHING works(This might be a bug on its own. even without dynamic changing None grad) https://github.com/NVIDIA/apex/blob/master/apex/parallel/distributed.py#L404

also I know very little about queue_callback. It seems from test you can only do that in a hook? If there is a way @chenhuiji can insert grad modification in that queue manually, it will solve the problem.

Another possible solution is just creating .grad for each parameter upfront and just zero them but never set them to None again(I don’t know what places it could happen but at least not do it manually to cheat speed). In this case no .grad should remain exist even with data dependency

Read more comments on GitHub >

github_iconTop Results From Across the Web

Missing key error when train with DistributedDataParallel
The error is related to DDP initialization, not model loading. You should initialize a distributed process group before creating a DDP module.
Read more >
Using DDP with fork fails - Training
Using the DDP example here DDP-based Scaling of Gaudi on PyTorch — Gaudi Documentation 1.6.0 documentation Tried to replace spawn with fork ...
Read more >
KeyError: 'LAD is already registered in models'
I am facing this problem with trying to run CoLad train on my server: Ubuntu 20.4, python 3.7 CUDA 10.1. This error came...
Read more >
多机多卡GPU分布式训练
使用NCCL进行多GPU训练(MULTI-GPU TRAINING WITH NCCL) ... 机器学习AI算法工程公众号:datayxDistributedDataParallel(DDP)是一个支持多机多卡、 ...
Read more >
Multi node PyTorch Distributed Training Guide For People ...
A few examples that showcase the boilerplate of PyTorch DDP training code. Have each example work with torch.distributed.launch , torchrun ...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found