Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

[RFC] Add `modeling_xxx_fusion.py` to support kernel fusion

See original GitHub issue

Introduction

I am an engineer currently working on 3D model parallelism for transformers. When the tensor model parallelism (https://github.com/huggingface/transformers/pull/13726) is done, I am going to introduce kernel fusion feature to transformers.

For this, I want to create a new modeling file called modeling_xxx_fusion.py. This work is currently being discussed with @stas00 and @RezaYazdaniAminabadi (DeepSpeed team).

Kernel fusion API

from transformers import BertForMaskedLM

# create model
model = BertForMaskedLM.from_pretrained("bert-base-cased")

# 1. fuse_modules 
# `fuse_modules` is function level fusion, It supports a wide variety of models.
# all arguments is `True` as default
model.fuse_modules()  

# fuse selective modules
model.fuse_modules(
    word_embedding=True,
    scale_mask_softmax=True,
    layer_norm=True,
    bias_act=True,
    bias_dropout_residual=False,
    cross_entropy=True,
)

# 2. fuse_layers 
# `fuse_layers` is block level (attention & mlp) fusion, only a few models are supported.
# argument (`inference`) is `None` -> `not self.training` of `torch.nn.Module` as default.
model.fuse_layers(inference=None)

# fuse layers for inference
model.fuse_layers(inference=True)

# fuse layers for training
model.fuse_layers(inference=False)

Implementation

The internal module of each model will be re-implemented using kernel fusion method, and the existed module will be replaced with the fused module. The following example is an example of BertOutput(nn.Module).

# transformers/models/bert/modeling_bert.py

class BertOutput(nn.Module):
      def __init__(self, config):
            super().__init__()
            self.dense = nn.Linear(config.intermediate_size, config.hidden_size)
            self.LayerNorm = nn.LayerNorm(config.hidden_size, eps=config.layer_norm_eps)
            self.dropout = nn.Dropout(config.hidden_dropout_prob)

      def forward(self, hidden_states, input_tensor):
            hidden_states = self.dense(hidden_states)
            hidden_states = self.dropout(hidden_states)
            hidden_states = self.LayerNorm(hidden_states + input_tensor)
            return hidden_states

# transformers/models/bert/modeling_bert_fusion.py

class FusedBertOutput(BertOutput):
      def forward(self, hidden_states, input_tensor):
            hidden_states = hidden_states @ self.dense.weight.t()
            hidden_states = FusedBiasDropoutResidual.apply(hidden_states, self.dense.bias, input_tensor)
            hidden_states = FusedLayerNorm.apply(hidden_states, self.LayerNorm.weight, self.LayerNorm.bias)
            return hidden_states

When the user calls the fuse_modules() method, the kernel fusion engine finds BertOutput and replaces it with FusedBertOutput. and user calls fused_layers method, engine finds BertLayer and replcases it with FusedBertLayer. This is the method that parallelformers parallelized transformers models flexibly, and the deepspeed also supports kernel fusion in this way.

However, the current version of deepspeed fuses the entire transformer layer, so the supported models are very limited. For example, bigbird requires random attention mechanism. in this case random attention must be implemented in the custom cuda kernel. However, because the number of models is so large, it is impossible to implement them all. So I propose a flexible way to fuse the kernel on a per-function. This is a strategy of triage. The area that can be fused performs fusion, and the area that can not be fused uses the torch’s default module.

# kernel_fusion_utils.py

class KernelFusionMixin(object):
    
    def fuse_modules(...):
        assert self._is_able_to_fuse, "error message"
        ... implementation ...

    def fuse_layers(...)
        assert self._is_able_to_fuse, "error message"
        ... implementation ...

# modeling_utils.py

class PreTrainedModel(..., KernelFusionMixin):
    _is_parallelizable = ...
    _is_able_to_fuse = False. # <--- Only models that can be fused have `True`.

This is a draft. The API can be changed at any time. I look forward to feedback. I’m going to show you this soon with a framework I’m making. (Like parallelformers, we will pre-open the repositories on our side and merge them later on transformers and deepspeed.)

cc. @Stas00 @RezaYazdaniAminabadi @Sylvain

Issue Analytics

State:
Created 2 years ago
Reactions:6
Comments:10 (8 by maintainers)

Top GitHub Comments

7reactions

hyunwoongkocommented, Oct 13, 2021

Review of Fused Kernels for transformer

written by Kevin Ko

If you find other fused kernels, please let me know here. I’ll test and record them. 😃

1. Module-level Kernels

Module-level Kernels are fused kernels for independent operation sets like scale + mask + softmax or bias + dropout + residual. This is in contrast to Layer-level kernels, which are kernels that fuse the entire transformer layer. Note all kernels must have both forward and backward when they are used for training.

FusedScaleMaskSoftmax (from Megaton-LM) https://github.com/NVIDIA/Megatron-LM/blob/main/megatron/model/fused_softmax.py This is a kernel that fuses scale + masking + softmax for transformer attention. We tested this kernel, and it performs better than the original HuggingFace Transformers attention method. Note there are some constraints to turn this kernel on. but some of constraints defined in Megatron-LM aren’t correct. So we modified some constraints.

The left one is a picture when the constraints are satisfied, and the right one is a picture when the constraints are not satisfied. (in this case, the performance is the same with non-fused method)

FusedLayerNorm (from Megatron-LM and NVIDIA Apex) https://github.com/NVIDIA/Megatron-LM/blob/main/megatron/model/fused_layer_norm.py This is kernel that fuses all the operations of layer normalization. But when we tested this kernel, it was slower than the original torch.nn.LayerNorm.

dim  128 Batch Size 2048, Torch: 0.00031060 Apex: 0.00035981
dim  256 Batch Size 2048, Torch: 0.00030215 Apex: 0.00035082 
dim  384 Batch Size 2048, Torch: 0.00033065 Apex: 0.00037036 
dim  512 Batch Size 2048, Torch: 0.00029822 Apex: 0.00035301 
dim  640 Batch Size 2048, Torch: 0.00031614 Apex: 0.00036779
dim  768 Batch Size 2048, Torch: 0.00030238 Apex: 0.00036041
dim  896 Batch Size 2048, Torch: 0.00029817 Apex: 0.00036967
dim 1024 Batch Size 2048, Torch: 0.00030955 Apex: 0.00036211

Therefore, we have decided not to provide this kernel. See https://github.com/pytorch/pytorch/commit/8b87f9a5107e8b3c4f87d5297af698bb55838d81#diff-f12c726e3e8cd2b4768f8984fef27059.

FusedBiasActivation (torch.jit.script) https://github.com/NVIDIA/Megatron-LM/blob/main/megatron/model/fused_bias_gelu.py This is kernel that fuses bias addition + GeLU function. All activation functions are supported because the user can use any activation function, but the speedup occurs only with the GeLU function. (The other activation functions work the same as before). We use GeLU Fast that is faster than the original GeLU implementation by providing all numerical values as they are already computed.
FusedBiasDropout (torch.jit.script) https://github.com/NVIDIA/Megatron-LM/blob/main/megatron/model/transformer.py#L395 This is kernel that fuses bias addition + dropout.
FusedBiasDropoutResidual (torch.jit.script) https://github.com/NVIDIA/Megatron-LM/blob/main/megatron/model/transformer.py#L395 This is kernel that fuses bias addition + dropout + residual addition.

The above images are the performance of FusedGPT2MLP made by combining FusedBiasActivation and FusedBiasDropout. These results show that two fused kernels can lead to a significant performance improvement over the original GPT2Attention.
FusedSplitHeads & FusedMergeHeads (torch.jit.script) We tried just in time (JIT) compile the view + permute + contiguous performed to split or merge heads in the transformer attention layer, but there was no difference in speed. Probably because it is not the elementwise operations, the performance improvement is expected to be negligible. Therefore, we have decided not to provide this kernel.
FusedCrossEntropy (from lightseq) https://github.com/bytedance/lightseq/blob/master/lightseq/training/ops/pytorch/cross_entropy_layer.py This is the kernel that fuses log_softmax + nll_loss. However, when we tested this kernel, it was about 2 ~ 3 times slower than the original torch.nn.CrossEntropyLoss. Therefore, we have decided not to provide this kernel. See https://github.com/bytedance/lightseq/issues/204.
```
CrossEntropyLoss:    0.0004372596740722656
LSCrossEntropyLayer: 0.0010995864868164062
```
FusedEmbedding (from lightseq) https://github.com/bytedance/lightseq/blob/master/lightseq/training/ops/pytorch/transformer_embedding_layer.py This is the kernel that fuses positional embedding + word embedding. We were very interested in this kernel, but unfortunately positional embedding only supports sinusoidal method. Almost all models today don’t use the sinusoidal method. Therefore, we have decided not to support this kernel.
FusedNoRepeatNGramLogitsProcessor (from fastseq) https://github.com/microsoft/fastseq/blob/main/fastseq/ops/ngram_repeat_block.py This is the kernel that performs no repeat ngram blocking on the GPU when generating text. As a result of the test, there is no significant impact when the text length is short, but it shows a very large performance improvement when the text length is long. So we modified GenerationMixin to include this kernel. you’ll be able to use this kernel by model.generate(..., no_repeat_ngram_size=n, fused_no_repeat_ngram_blocking=True) later.
```
Generation Speed (sec / 500 tokens)

non fusion: 10.293807029724121
module fusion: 8.77494215965271
module fusion + fused ngram: 8.045531034469604
layer fusion + fused ngram: 5.359241008758545   
```

I will also review layer-level kernels during this week. 😉

2reactions

Chilleecommented, Jan 21, 2022

@stas00 We’ll be cutting a branch that works with PyTorch 1.11.0, and to be honest, I don’t think it’d be that hard to cut a release for 1.10.1 now either.

So, I think the issues with user setup are not that difficult to resolve.