[Tensor Parallelism] Megatron-LM to transformers
See original GitHub issue🚀 Feature request
Splitting the discussion that started here: https://github.com/huggingface/transformers/pull/10301#issuecomment-782917393 to add the potential future feature of transformers and it’s Tensor Parallelism (Horizontal Model Parallelism) - for bigger context please see Parallelism notes.
Let’s start with important clarification: MP can mean many different things
- Vertical MP - slice the layers vertically - one or more full layers placed on each gpu = Vertical MP - in which case VertMP is a simple version of PP with chunks=1
- Horizontal MP - slice the layers horizontally - place a slice of a full model on each gpu - Example Megatron-LM
At the moment I think it’s only Megatron-LM that implements Horizontal MP. @anthon-l has ported that model to transformers
, except the Horizontal MP parts, since currently transformers
doesn’t yet have support for it. There is already naive Vertical MP in t5 and gpt2 thanks to @alexorona’s work, I ported Bart too but it’s unmerged, and there is an ongoing effort to figure out how to implement the Pipeline. All these will have to co-operate with each other and also share common tools.
@anton-l started sharing what needs to be done to make that important feature available - and then down the road potentially make it available to other (all?) transformers
models.
@anton-l, the floor is yours.
Issue Analytics
- State:
- Created 3 years ago
- Reactions:4
- Comments:7 (6 by maintainers)
Top GitHub Comments
@stas00 thanks for starting this thread!
I guess, in order for everyone to be on the same page, a brief explanation of horizontal parallelism is needed. This would be a good place for future reference and introduce other contributors to the core concepts.
NOTE for everyone reading: If you find any of the explanations below confusing, you can read about Megatron-LM in much more detail in its original paper: https://arxiv.org/pdf/1909.08053.pdf
The core idea
The main thing that separates Megatron-style (horizontal) parallelism from vertical parallelism is the way that it splits the model layers between GPUs without the need for idle time during training/inference (i.e. waiting while the previous GPUs complete their work on the previous layers of the model). This makes the whole process much more asynchronous, just like in MapReduce. Here’s my rough sketch of how it looks:
Now the question is, how do we split the computation of those layers so that the parallelized model weights would be equivalent to the CPU ones?
Parallelized layers
Let’s start with a simple building block of any transformer: a fully connected layer (nn.Linear) followed by a nonlinear activation (GeLU). Following the Megatron’s paper notation, we can write the dot-product part of it as
Y = GeLU(XA)
, whereX
andY
are the input and output vectors, andA
is the weight matrix.If we look at the computation in matrix form, it’s easy to see how the matrix multiplication can be split between multiple GPUs: Basically, if we split the weight matrix
A
column-wise acrossN
GPUs and perform matrix multiplicationsXA_1
throughXA_n
in parallel, then we will end up withN
output vectorsY_1, Y_2, ..., Y_n
which can be fed into GeLU independently:Using this principle, we can update an MLP of arbitrary depth, without the need for any synchronization between GPUs until the very end, where we need to reconstruct the output vector from shards. The authors provide a helpful illustration for that:
Quick note on self-attention
Parallelizing the multiheaded attention layers is even simpler, since they are already inherently parallel, due to having multiple independent heads!
Practical implementation
If you want to just dive right in, here are the basic building blocks implemented in Megatron-LM:
All of these rely on basic
Scatter
,Gather
andReduce
ops to split and aggregate the weight matrices. Thanks to PyTorch Distributed, we can usetorch.distributed.all_reduce
andall_gather
for that, without having to worry about GPU synchronization. The scatter and gather layers just have to define appropriate forward and backward passes like so:In a single transformer layer, there are 4 communication operations in total, for the forward and backward passes:
Other things to consider
Parallelized embeddings and output logits
Since the weights of input and output embeddings of BERT/GPT2 are tied, they require a coordinated modification. In the original implementation, the input embedding matrix is parallelized along the vocabulary dimension (column-wise), and the output embeddings’ matrix multiplications is parallelized together with the cross-entropy loss to reduce the communication size (see end of section 3 in the paper):
Model parallelism-aware Dropout
Transformers have dropout layers outside the model parallel regions before residual connections and within model parallel regions in the self attention block. Because some dropout layers are in a model parallel region, while others are not, we need to treat random number generation carefully to ensure dropout works correctly. See appendix B.2 in the paper for reference. The necessary RNG state tracking is implemented in random.py
Hybrid model and data parallelism
Combining horizontal parallelism with data parallelism requires grouping the GPUs in a specific way, as described in appendix B.1:
Sounds good on all accounts. GPT2 would be perfect, @anton-l!
I had the same thought about just splitting your merged model if needed.
Please let us know how we can support you in this endeavor.
just for you to be aware, I mentioned in the other thread the DeepSpeed version of their Megatron-LM port - perhaps theirs is newer - I haven’t had a chance to study it yet. https://github.com/jeffra/DSE/tree/master/megatron-lm . You can diff the different versions against the baseline - that is I assume it has been changed - perhaps it hasn’t. If you want to have a look, if not, it is good too. It will be good to start anywhere.