Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

KeyError when using DDP

See original GitHub issue

Using DistributedDataParallel results a KeyError with the following trace

Traceback (most recent call last):
  File "training_script.py", line 328, in iteration
    loss.backward()
  File "/anaconda3/lib/python3.7/site-packages/torch/tensor.py", line 102, in backward
    torch.autograd.backward(self, gradient, retain_graph, create_graph)
  File "/anaconda3/lib/python3.7/site-packages/torch/autograd/__init__.py", line 90, in backward
    allow_unreachable=True)  # allow_unreachable flag
  File "/anaconda3/lib/python3.7/site-packages/apex-0.1-py3.7-linux-x86_64.egg/apex/parallel/distributed.py", line 362, in allreduce_hook
    self.comm_ready_buckets(param)
  File "/anaconda3/lib/python3.7/site-packages/apex-0.1-py3.7-linux-x86_64.egg/apex/parallel/distributed.py", line 422, in comm_ready_buckets
    bucket_idx, bucket_loc = self.param_id_to_bucket[id(param)]
KeyError: 139779067778824

pytorch: 1.0.0 apex compiled without --cpp_ext

Issue Analytics

State:
Created 5 years ago
Comments:11 (4 by maintainers)

Top GitHub Comments

1reaction

mcarillicommented, Feb 10, 2019

Do you wrap both the top level module and the submodule in DDP separately? If so, that might result in some params having more than one allreduce hook registered…and i’m not sure what will happen.

Apex DDP should support partial freezing or unfreezing of your model between iterations. I THINK if you wrap only the top level module in DDP, then manually freeze (unfreeze) the parts of the model that you don’t want to train, by setting requires_grad=False (True) on the frozen (unfrozen) parts before beginning forward passes, then Apex DDP will detect that the set of non-frozen (requires_grad=True) params has changed, and recreate the bucket structure properly.

Another thing you can try is the delay_allreduce=True option, which waits until the end of backward and collects whatever gradients the model happened to create.

0reactions

FDecaYedcommented, Mar 2, 2019

@mcarilli I’ve also been asked this. From my rough understanding, modifying grad for those are None with in the callback here would work in delayed case? Of course assuming the parameter you actually register_hook on won’t have None grad, otherwise NOTHING works(This might be a bug on its own. even without dynamic changing None grad) https://github.com/NVIDIA/apex/blob/master/apex/parallel/distributed.py#L404

also I know very little about queue_callback. It seems from test you can only do that in a hook? If there is a way @chenhuiji can insert grad modification in that queue manually, it will solve the problem.

Another possible solution is just creating .grad for each parameter upfront and just zero them but never set them to None again(I don’t know what places it could happen but at least not do it manually to cheat speed). In this case no .grad should remain exist even with data dependency