Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

[RFC] adding Tensor and Pipeline Parallelism to transformers

See original GitHub issue

Following up on this proposal https://github.com/huggingface/transformers/issues/12772 I just had a discussion with @hyunwoongko (with great help from @JakeTae who patiently translated for us), and we tried to discuss a strategy of how to best integrate Tensor Parallelism (TP) and Pipeline Parallelism (PP) into transformers, making it easy for reviewers and the contributors. Note that parallelformers currently implements only TP.

So here is a great example of how the TP can be added, as @hyunwoongko already implemented it in his fork for GPTNeo https://github.com/tunib-ai/transformers/commit/5bf8655be624b3aeda799b80fddd220213491b04 (he didn’t use GPT2 since it already has the naive PP implemented). So you can see exactly what we want to merge. It’s a very thin layer to the model and most of the functionality is in the helper parallel utils. The end of the change is multiple tests/examples that need to be converted to our test framework.

Now, while adding TP is relatively easy, adding PP is very complex in the current state of HF models because they include many features that interfere with implementing PP - due to the requirements:

for the model to be nn.Sequential and
inputs/outputs to be simple tensors with the first dimension of batch size.

So to implement PP we will most likely have to fork each model, strip the unnecessary for scalability features and only then be able to implement PP.

So my thinking is that perhaps we do it from the get the going? Instead of integrating TP into the normal model - say GPTNeo, we fork it to say GTPNeo3D from the get going and do all the work including TP and PP on that new model. Once everybody is happy we can rinse and repeat for other models.

I added 3D to GPTNeo to make GTPNeo3D - 3D = DP/TP/PP - not exactly sure about this particular name or attached to it, this is just something to start with.

Also once TP is implemented in say GTPNeo3D we can start replicating it to other models. Because parallelformers has them all covered already. PP will be much harder and we can do this in parallel.

I wanted to check in with the team to see if this approach resonates better, rather than modifying the existing models.

Thank you!

Also see this blog post explaining parallelforms.

Additionally see the main pytorch Parallelism discussion at https://github.com/pytorch/rfcs/pull/32

@LysandreJik, @sgugger, @patrickvonplaten

Issue Analytics

State:
Created 2 years ago
Reactions:5
Comments:107 (83 by maintainers)

Top GitHub Comments

6reactions

lucasleeswcommented, Jan 13, 2022

Hi there, we are developing an auto 3D parallelism distributed software for transformers models.

We plan to integrate this software into transfomers.trainer class and make it use as easily as the original trainer.

We use torch.fx and transformers.utils.fx for graph extraction, automatically partition the extracted graph and use pipeline runtime engine from Deepspeed for pipeline parallelism.

For tensor parallelism we use a config mapping to support megatron.mpu.layers in transfomers models automatically.

A prototype version is finished now. Theoretically, any fx traceable model could be run in 3D parallelism. More models are under testing and we are making this software more functional intensively. We will open source it very soon.

Any advice will be appreciate~

4reactions

pritamdamania87commented, Oct 9, 2021

@pritamdamania87, have you by chance contemplated an approach where any model could be made to support PP w/ only minor mods or none using automatic splitting based on the graph? For context please see 3 comments up: #13690 (comment)

Yes, we have thought about an automated partitioning solution where you don’t need to rewrite your model as nn.Sequential and you just need pass in an nn.Module and the pipelining framework takes care of the rest. One potential idea was to extract the graph using torch.fx, inspect the graph and appropriately partition it across devices. cc @wanchaol Was wondering if the design we brainstormed is in a state that we can share it publicly in an RFC?

Top Results From Across the Web

Model Parallelism — transformers 4.7.0 documentation

TensorParallel (TP) - each tensor is split up into multiple chunks, so instead of having the whole tensor reside on a single gpu,...

Models, Preprocessors, and Action Distributions — Ray 2.2.0

In summary, repeated fields are “pushed down” and become the outer dimensions of tensor batches, as illustrated in this figure from the StructTensor...

Roller: Fast and Efficient Tensor Compilation for Deep Learning

It first performs the scale-up process, which adopts a recursive rTile-based con- struction algorithm (Figure 8) to gradually increase the size.

pyspark.ml package — PySpark 2.2.0 documentation

If stages is an empty list, the pipeline acts as an identity transformer. ... A column “distCol” is added to show the distance...

Model Zoo - Deep learning code and pretrained models for ...

ModelZoo curates and provides a platform for deep learning researchers to easily find code and pre-trained models for a variety of platforms and...