question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Parallel all reduce communication and backprop

See original GitHub issue

Thank you for open source such a great repo for the community! Your work is really helping our team with training large pretrained model 😃

In our experiment, we find out when training a not-that-large model, e.g. 2.7B, with data parallel, the scaling efficiency across multiple nodes is not good enough (under 70% for 2 nodes in our case). A reason for this is that currenly the backward calculation (“BackwardPass” instruction) and the communication (introduced in “ReduceGrads” instruction) are executed sequentially. In fact, if we start the allreduce communication right after each grad is calculated, we could parallel the backward computation and the ReduceGrads, reducing the negative effect on cross-node communication.

image

We could use the backward hook mechaism in pytorch for this optimization. Here is an example in the source code of pytorch.

This optimization may only work for pure data parallel as the communication pattern is quite different in model parallel or pipeline parallel.

We’d love to help if you have interest in applying such optimization to your project (gpt-neox or DeeperSpeed)~ Thank you again for your great contribution to the community!

P.S. We found some different behavior compare to the comment here: https://github.com/EleutherAI/gpt-neox/blob/f6c611f3211521fa7b145950ea100f44a2d0ead6/megatron/neox_arguments/arguments.py#L755-L758

  • In our experiement, the PipelineModule wrapper is used when pipe_parallel_size is set to 1 and the to_sequential() version is used only when pipe_parallel_size is set to 0;
  • The PipelineModule is observably faster than the to_sequential() version.

I wonder if these are expected behavior? Thank you.

Issue Analytics

  • State:closed
  • Created 2 years ago
  • Comments:6 (4 by maintainers)

github_iconTop GitHub Comments

1reaction
sdtblckcommented, Mar 1, 2022

Hey @zhuzilin , really interesting!

firstly, wrt to the speed difference between pp=0 and pp=1, we also found a similar thing, see https://github.com/EleutherAI/gpt-neox/pull/269 . Although maybe the speed difference isn’t quite as stark as what you found. I’m not sure of the source of the difference.

wrt the optimization, I see no reason this couldn’t also work with MP and PP, and we’d be very interested in getting something like this implemented. I suspect it might not be so straightforward with deepspeed though! Fundamentally, you’re doing the same communication op with MP / PP, just the group you’re reducing within is smaller. So I think this should definitely be possible, but i’m not yet certain how this optimization would interact with:

  1. deepspeed. All training currently relies on deepspeed engine, and they “handle” DP optimization for you. We would have to figure out how to fully handle this ourselves, or implement the optimization into deepspeed. We’re trying to remove our dependency on deepspeed and move to OSLO, but this will likely take a while.)
  2. Zero 1 / 2. This also ties in with the above, since these optimizers are implemented in deepspeed. But making this optimization compatible with zero 1 / 2 optimizers would likely require some more work.
0reactions
StellaAthenacommented, Feb 28, 2022

@ShivanshuPurohit is gong to look into this 😃

@reyoung can you post whatever performance statistics you have with your 2 cluster set-up? FLOPS, % comms, etc?

Read more comments on GitHub >

github_iconTop Results From Across the Web

Demystifying Parallel and Distributed Deep Learning - arXiv
Fully connected layers, for instance, incur all-to-all communication (as opposed to allreduce in data parallelism), as neurons connect to all the neurons of...
Read more >
Bringing HPC Techniques to Deep Learning - Andrew Gibiansky
This technique, the ring allreduce, reduces the amount of time spent communicating between different GPUs, allowing them to spend more of ...
Read more >
Data-Parallel Distributed Training of Deep Learning Models
It allows you to train your model faster by replicating the model among multiple compute nodes, and dividing the dataset among them. Data...
Read more >
Parallelism in Distributed Deep Learning - Insu Jang
The idea is to perform updates without using centralized server, by communicating directly with each other accelerators. Refer to the reference ...
Read more >
Parallelizing Backpropagation Neural Network Using ...
3.3.1. Parallelization in Training · for every neuron. Afterwards the · th BPNN inputs the instances of the data chunk ·. As long...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found