[RFC] adding Tensor and Pipeline Parallelism to transformers
See original GitHub issueFollowing up on this proposal https://github.com/huggingface/transformers/issues/12772 I just had a discussion with @hyunwoongko (with great help from @JakeTae who patiently translated for us), and we tried to discuss a strategy of how to best integrate Tensor Parallelism (TP) and Pipeline Parallelism (PP) into transformers
, making it easy for reviewers and the contributors. Note that
parallelformers currently implements only TP.
So here is a great example of how the TP can be added, as @hyunwoongko already implemented it in his fork for GPTNeo
https://github.com/tunib-ai/transformers/commit/5bf8655be624b3aeda799b80fddd220213491b04 (he didn’t use GPT2
since it already has the naive PP implemented). So you can see exactly what we want to merge. It’s a very thin layer to the model and most of the functionality is in the helper parallel utils. The end of the change is multiple tests/examples that need to be converted to our test framework.
Now, while adding TP is relatively easy, adding PP is very complex in the current state of HF models because they include many features that interfere with implementing PP - due to the requirements:
- for the model to be
nn.Sequential
and - inputs/outputs to be simple tensors with the first dimension of batch size.
So to implement PP we will most likely have to fork each model, strip the unnecessary for scalability features and only then be able to implement PP.
So my thinking is that perhaps we do it from the get the going? Instead of integrating TP into the normal model - say GPTNeo
, we fork it to say GTPNeo3D
from the get going and do all the work including TP and PP on that new model. Once everybody is happy we can rinse and repeat for other models.
I added 3D to GPTNeo
to make GTPNeo3D
- 3D = DP/TP/PP - not exactly sure about this particular name or attached to it, this is just something to start with.
Also once TP is implemented in say GTPNeo3D
we can start replicating it to other models. Because parallelformers has them all covered already. PP will be much harder and we can do this in parallel.
I wanted to check in with the team to see if this approach resonates better, rather than modifying the existing models.
Thank you!
Also see this blog post explaining parallelforms.
Additionally see the main pytorch Parallelism discussion at https://github.com/pytorch/rfcs/pull/32
Issue Analytics
- State:
- Created 2 years ago
- Reactions:5
- Comments:107 (83 by maintainers)
Hi there, we are developing an auto 3D parallelism distributed software for transformers models.
We plan to integrate this software into
transfomers.trainer
class and make it use as easily as the original trainer.We use
torch.fx
andtransformers.utils.fx
for graph extraction, automatically partition the extracted graph and use pipeline runtime engine from Deepspeed for pipeline parallelism.For tensor parallelism we use a config mapping to support
megatron.mpu.layers
in transfomers models automatically.A prototype version is finished now. Theoretically, any fx traceable model could be run in 3D parallelism. More models are under testing and we are making this software more functional intensively. We will open source it very soon.
Any advice will be appreciate~
Yes, we have thought about an automated partitioning solution where you don’t need to rewrite your model as
nn.Sequential
and you just need pass in annn.Module
and the pipelining framework takes care of the rest. One potential idea was to extract the graph usingtorch.fx
, inspect the graph and appropriately partition it across devices. cc @wanchaol Was wondering if the design we brainstormed is in a state that we can share it publicly in an RFC?