Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

[Tensor Parallelism] Megatron-LM to transformers

See original GitHub issue

🚀 Feature request

Splitting the discussion that started here: https://github.com/huggingface/transformers/pull/10301#issuecomment-782917393 to add the potential future feature of transformers and it’s Tensor Parallelism (Horizontal Model Parallelism) - for bigger context please see Parallelism notes.

Let’s start with important clarification: MP can mean many different things

Vertical MP - slice the layers vertically - one or more full layers placed on each gpu = Vertical MP - in which case VertMP is a simple version of PP with chunks=1
Horizontal MP - slice the layers horizontally - place a slice of a full model on each gpu - Example Megatron-LM

At the moment I think it’s only Megatron-LM that implements Horizontal MP. @anthon-l has ported that model to transformers, except the Horizontal MP parts, since currently transformers doesn’t yet have support for it. There is already naive Vertical MP in t5 and gpt2 thanks to @alexorona’s work, I ported Bart too but it’s unmerged, and there is an ongoing effort to figure out how to implement the Pipeline. All these will have to co-operate with each other and also share common tools.

@anton-l started sharing what needs to be done to make that important feature available - and then down the road potentially make it available to other (all?) transformers models.

@anton-l, the floor is yours.

Issue Analytics

State:
Created 3 years ago
Reactions:4
Comments:7 (6 by maintainers)

Top GitHub Comments

15reactions

anton-lcommented, Feb 22, 2021

@stas00 thanks for starting this thread!

I guess, in order for everyone to be on the same page, a brief explanation of horizontal parallelism is needed. This would be a good place for future reference and introduce other contributors to the core concepts.

NOTE for everyone reading: If you find any of the explanations below confusing, you can read about Megatron-LM in much more detail in its original paper: https://arxiv.org/pdf/1909.08053.pdf

The core idea

The main thing that separates Megatron-style (horizontal) parallelism from vertical parallelism is the way that it splits the model layers between GPUs without the need for idle time during training/inference (i.e. waiting while the previous GPUs complete their work on the previous layers of the model). This makes the whole process much more asynchronous, just like in MapReduce. Here’s my rough sketch of how it looks: Model parallelism

Now the question is, how do we split the computation of those layers so that the parallelized model weights would be equivalent to the CPU ones?

Parallelized layers

Let’s start with a simple building block of any transformer: a fully connected layer (nn.Linear) followed by a nonlinear activation (GeLU). Following the Megatron’s paper notation, we can write the dot-product part of it as Y = GeLU(XA), where X and Y are the input and output vectors, and A is the weight matrix.

If we look at the computation in matrix form, it’s easy to see how the matrix multiplication can be split between multiple GPUs: Parallel GEMM (1) Basically, if we split the weight matrix A column-wise across N GPUs and perform matrix multiplications XA_1 through XA_n in parallel, then we will end up with N output vectors Y_1, Y_2, ..., Y_n which can be fed into GeLU independently:

Using this principle, we can update an MLP of arbitrary depth, without the need for any synchronization between GPUs until the very end, where we need to reconstruct the output vector from shards. The authors provide a helpful illustration for that:

Quick note on self-attention

Parallelizing the multiheaded attention layers is even simpler, since they are already inherently parallel, due to having multiple independent heads!

Practical implementation

If you want to just dive right in, here are the basic building blocks implemented in Megatron-LM:

All of these rely on basic Scatter, Gather and Reduce ops to split and aggregate the weight matrices. Thanks to PyTorch Distributed, we can use torch.distributed.all_reduce and all_gather for that, without having to worry about GPU synchronization. The scatter and gather layers just have to define appropriate forward and backward passes like so:

def _split(input_):
    world_size = get_tensor_model_parallel_world_size()
    input_list = split_tensor_along_last_dim(input_, world_size)
    rank = get_tensor_model_parallel_rank()
    output = input_list[rank].contiguous()
    return output

def _gather(input_):
    world_size = get_tensor_model_parallel_world_size()
    last_dim = input_.dim() - 1
    rank = get_tensor_model_parallel_rank()
    tensor_list = [torch.empty_like(input_) for _ in range(world_size)]
    tensor_list[rank] = input_
    torch.distributed.all_gather(tensor_list, input_, group=get_tensor_model_parallel_group())
    output = torch.cat(tensor_list, dim=last_dim).contiguous()
    return output

class ScatterToModelParallelRegion(torch.autograd.Function):
    def forward(ctx, input_):
        return _split(input_)

    def backward(ctx, grad_output):
        return _gather(grad_output)

class GatherFromModelParallelRegion(torch.autograd.Function):
    def forward(ctx, input_):
        return _gather(input_)

    def backward(ctx, grad_output):
        return _split(grad_output)

In a single transformer layer, there are 4 communication operations in total, for the forward and backward passes:

Other things to consider

Parallelized embeddings and output logits

Since the weights of input and output embeddings of BERT/GPT2 are tied, they require a coordinated modification. In the original implementation, the input embedding matrix is parallelized along the vocabulary dimension (column-wise), and the output embeddings’ matrix multiplications is parallelized together with the cross-entropy loss to reduce the communication size (see end of section 3 in the paper):

Model parallelism-aware Dropout

Transformers have dropout layers outside the model parallel regions before residual connections and within model parallel regions in the self attention block. Because some dropout layers are in a model parallel region, while others are not, we need to treat random number generation carefully to ensure dropout works correctly. See appendix B.2 in the paper for reference. The necessary RNG state tracking is implemented in random.py

Hybrid model and data parallelism

Combining horizontal parallelism with data parallelism requires grouping the GPUs in a specific way, as described in appendix B.1:

2reactions

stas00commented, Feb 23, 2021

Sounds good on all accounts. GPT2 would be perfect, @anton-l!

I had the same thought about just splitting your merged model if needed.

Please let us know how we can support you in this endeavor.

just for you to be aware, I mentioned in the other thread the DeepSpeed version of their Megatron-LM port - perhaps theirs is newer - I haven’t had a chance to study it yet. https://github.com/jeffra/DSE/tree/master/megatron-lm . You can diff the different versions against the baseline - that is I assume it has been changed - perhaps it hasn’t. If you want to have a look, if not, it is good too. It will be good to start anywhere.

Top Results From Across the Web

Model Parallelism — transformers 4.7.0 documentation

Tensor Parallelism In this section we use concepts and diagrams from the Megatron-LM paper: Efficient Large-Scale Language Model Training on GPU Clusters....

Megatron-LM: Training Multi-Billion Parameter Language ...

We implement a simple and efficient model parallel approach by making only a few targeted modifications to an existing PyTorch transformer implementation. We ......

Efficient Large-Scale Language Model Training on GPU ...

With tensor model parallelism, individual layers of the model are partitioned over multiple devices. In this paper, we use the particular partitioning strategy ......

Parallelize Your Training like Megatron-LM via ColoTensor

Megatron -LM provided a profound paradigm to parallelize large transformer language models. However, in order to train large transformer language ...

NeMo Megatron - NVIDIA Documentation Center

Megatron -LM is a highly optimized and efficient library for training large language models. With Megatron model parallelism, language models can be trained...