[RFC] Add `modeling_xxx_fusion.py` to support kernel fusion
See original GitHub issueIntroduction
I am an engineer currently working on 3D model parallelism for transformers. When the tensor model parallelism (https://github.com/huggingface/transformers/pull/13726) is done, I am going to introduce kernel fusion feature to transformers.
For this, I want to create a new modeling file called modeling_xxx_fusion.py
. This work is currently being discussed with @stas00 and @RezaYazdaniAminabadi (DeepSpeed team).
Kernel fusion API
from transformers import BertForMaskedLM
# create model
model = BertForMaskedLM.from_pretrained("bert-base-cased")
# 1. fuse_modules
# `fuse_modules` is function level fusion, It supports a wide variety of models.
# all arguments is `True` as default
model.fuse_modules()
# fuse selective modules
model.fuse_modules(
word_embedding=True,
scale_mask_softmax=True,
layer_norm=True,
bias_act=True,
bias_dropout_residual=False,
cross_entropy=True,
)
# 2. fuse_layers
# `fuse_layers` is block level (attention & mlp) fusion, only a few models are supported.
# argument (`inference`) is `None` -> `not self.training` of `torch.nn.Module` as default.
model.fuse_layers(inference=None)
# fuse layers for inference
model.fuse_layers(inference=True)
# fuse layers for training
model.fuse_layers(inference=False)
Implementation
The internal module of each model will be re-implemented using kernel fusion method, and the existed module will be replaced with the fused module. The following example is an example of BertOutput(nn.Module)
.
# transformers/models/bert/modeling_bert.py
class BertOutput(nn.Module):
def __init__(self, config):
super().__init__()
self.dense = nn.Linear(config.intermediate_size, config.hidden_size)
self.LayerNorm = nn.LayerNorm(config.hidden_size, eps=config.layer_norm_eps)
self.dropout = nn.Dropout(config.hidden_dropout_prob)
def forward(self, hidden_states, input_tensor):
hidden_states = self.dense(hidden_states)
hidden_states = self.dropout(hidden_states)
hidden_states = self.LayerNorm(hidden_states + input_tensor)
return hidden_states
# transformers/models/bert/modeling_bert_fusion.py
class FusedBertOutput(BertOutput):
def forward(self, hidden_states, input_tensor):
hidden_states = hidden_states @ self.dense.weight.t()
hidden_states = FusedBiasDropoutResidual.apply(hidden_states, self.dense.bias, input_tensor)
hidden_states = FusedLayerNorm.apply(hidden_states, self.LayerNorm.weight, self.LayerNorm.bias)
return hidden_states
When the user calls the fuse_modules()
method, the kernel fusion engine finds BertOutput
and replaces it with FusedBertOutput
. and user calls fused_layers
method, engine finds BertLayer
and replcases it with FusedBertLayer
. This is the method that parallelformers
parallelized transformers models flexibly, and the deepspeed
also supports kernel fusion in this way.
However, the current version of deepspeed
fuses the entire transformer layer, so the supported models are very limited. For example, bigbird requires random attention mechanism. in this case random attention must be implemented in the custom cuda kernel. However, because the number of models is so large, it is impossible to implement them all. So I propose a flexible way to fuse the kernel on a per-function. This is a strategy of triage. The area that can be fused performs fusion, and the area that can not be fused uses the torch’s default module.
# kernel_fusion_utils.py
class KernelFusionMixin(object):
def fuse_modules(...):
assert self._is_able_to_fuse, "error message"
... implementation ...
def fuse_layers(...)
assert self._is_able_to_fuse, "error message"
... implementation ...
# modeling_utils.py
class PreTrainedModel(..., KernelFusionMixin):
_is_parallelizable = ...
_is_able_to_fuse = False. # <--- Only models that can be fused have `True`.
This is a draft. The API can be changed at any time. I look forward to feedback. I’m going to show you this soon with a framework I’m making. (Like parallelformers, we will pre-open the repositories on our side and merge them later on transformers and deepspeed.)
Issue Analytics
- State:
- Created 2 years ago
- Reactions:6
- Comments:10 (8 by maintainers)
Review of Fused Kernels for transformer
If you find other fused kernels, please let me know here. I’ll test and record them. 😃
1. Module-level Kernels
Module-level Kernels are fused kernels for independent operation sets like
scale + mask + softmax
orbias + dropout + residual
. This is in contrast to Layer-level kernels, which are kernels that fuse the entire transformer layer. Note all kernels must have both forward and backward when they are used for training.FusedScaleMaskSoftmax (from Megaton-LM) https://github.com/NVIDIA/Megatron-LM/blob/main/megatron/model/fused_softmax.py This is a kernel that fuses
scale + masking + softmax
for transformer attention. We tested this kernel, and it performs better than the original HuggingFace Transformers attention method. Note there are some constraints to turn this kernel on. but some of constraints defined in Megatron-LM aren’t correct. So we modified some constraints.The left one is a picture when the constraints are satisfied, and the right one is a picture when the constraints are not satisfied. (in this case, the performance is the same with non-fused method)
FusedLayerNorm (from Megatron-LM and NVIDIA Apex) https://github.com/NVIDIA/Megatron-LM/blob/main/megatron/model/fused_layer_norm.py This is kernel that fuses all the operations of layer normalization. But when we tested this kernel, it was slower than the original
torch.nn.LayerNorm
.Therefore, we have decided not to provide this kernel. See https://github.com/pytorch/pytorch/commit/8b87f9a5107e8b3c4f87d5297af698bb55838d81#diff-f12c726e3e8cd2b4768f8984fef27059.
FusedBiasActivation (torch.jit.script) https://github.com/NVIDIA/Megatron-LM/blob/main/megatron/model/fused_bias_gelu.py This is kernel that fuses
bias addition + GeLU function
. All activation functions are supported because the user can use any activation function, but the speedup occurs only with the GeLU function. (The other activation functions work the same as before). We use GeLU Fast that is faster than the original GeLU implementation by providing all numerical values as they are already computed.FusedBiasDropout (torch.jit.script) https://github.com/NVIDIA/Megatron-LM/blob/main/megatron/model/transformer.py#L395 This is kernel that fuses
bias addition + dropout
.FusedBiasDropoutResidual (torch.jit.script) https://github.com/NVIDIA/Megatron-LM/blob/main/megatron/model/transformer.py#L395 This is kernel that fuses
bias addition + dropout + residual addition
.The above images are the performance of
FusedGPT2MLP
made by combiningFusedBiasActivation
andFusedBiasDropout
. These results show that two fused kernels can lead to a significant performance improvement over the originalGPT2Attention
.FusedSplitHeads & FusedMergeHeads (torch.jit.script) We tried just in time (JIT) compile the
view + permute + contiguous
performed to split or merge heads in the transformer attention layer, but there was no difference in speed. Probably because it is not the elementwise operations, the performance improvement is expected to be negligible. Therefore, we have decided not to provide this kernel.FusedCrossEntropy (from lightseq) https://github.com/bytedance/lightseq/blob/master/lightseq/training/ops/pytorch/cross_entropy_layer.py This is the kernel that fuses
log_softmax + nll_loss
. However, when we tested this kernel, it was about 2 ~ 3 times slower than the originaltorch.nn.CrossEntropyLoss
. Therefore, we have decided not to provide this kernel. See https://github.com/bytedance/lightseq/issues/204.FusedEmbedding (from lightseq) https://github.com/bytedance/lightseq/blob/master/lightseq/training/ops/pytorch/transformer_embedding_layer.py This is the kernel that fuses
positional embedding + word embedding
. We were very interested in this kernel, but unfortunately positional embedding only supports sinusoidal method. Almost all models today don’t use the sinusoidal method. Therefore, we have decided not to support this kernel.FusedNoRepeatNGramLogitsProcessor (from fastseq) https://github.com/microsoft/fastseq/blob/main/fastseq/ops/ngram_repeat_block.py This is the kernel that performs
no repeat ngram blocking
on the GPU when generating text. As a result of the test, there is no significant impact when the text length is short, but it shows a very large performance improvement when the text length is long. So we modifiedGenerationMixin
to include this kernel. you’ll be able to use this kernel bymodel.generate(..., no_repeat_ngram_size=n, fused_no_repeat_ngram_blocking=True)
later.I will also review layer-level kernels during this week. 😉
@stas00 We’ll be cutting a branch that works with PyTorch 1.11.0, and to be honest, I don’t think it’d be that hard to cut a release for 1.10.1 now either.
So, I think the issues with user setup are not that difficult to resolve.