Parallel all reduce communication and backprop
See original GitHub issueThank you for open source such a great repo for the community! Your work is really helping our team with training large pretrained model 😃
In our experiment, we find out when training a not-that-large model, e.g. 2.7B, with data parallel, the scaling efficiency across multiple nodes is not good enough (under 70% for 2 nodes in our case). A reason for this is that currenly the backward calculation (“BackwardPass” instruction) and the communication (introduced in “ReduceGrads” instruction) are executed sequentially. In fact, if we start the allreduce communication right after each grad is calculated, we could parallel the backward computation and the ReduceGrads, reducing the negative effect on cross-node communication.
We could use the backward hook mechaism in pytorch for this optimization. Here is an example in the source code of pytorch.
This optimization may only work for pure data parallel as the communication pattern is quite different in model parallel or pipeline parallel.
We’d love to help if you have interest in applying such optimization to your project (gpt-neox or DeeperSpeed)~ Thank you again for your great contribution to the community!
P.S. We found some different behavior compare to the comment here: https://github.com/EleutherAI/gpt-neox/blob/f6c611f3211521fa7b145950ea100f44a2d0ead6/megatron/neox_arguments/arguments.py#L755-L758
- In our experiement, the
PipelineModule
wrapper is used whenpipe_parallel_size
is set to 1 and theto_sequential()
version is used only whenpipe_parallel_size
is set to 0; - The
PipelineModule
is observably faster than theto_sequential()
version.
I wonder if these are expected behavior? Thank you.
Issue Analytics
- State:
- Created 2 years ago
- Comments:6 (4 by maintainers)
Top GitHub Comments
Hey @zhuzilin , really interesting!
firstly, wrt to the speed difference between pp=0 and pp=1, we also found a similar thing, see https://github.com/EleutherAI/gpt-neox/pull/269 . Although maybe the speed difference isn’t quite as stark as what you found. I’m not sure of the source of the difference.
wrt the optimization, I see no reason this couldn’t also work with MP and PP, and we’d be very interested in getting something like this implemented. I suspect it might not be so straightforward with deepspeed though! Fundamentally, you’re doing the same communication op with MP / PP, just the group you’re reducing within is smaller. So I think this should definitely be possible, but i’m not yet certain how this optimization would interact with:
@ShivanshuPurohit is gong to look into this 😃
@reyoung can you post whatever performance statistics you have with your 2 cluster set-up? FLOPS, % comms, etc?