model parallelism for BART
See original GitHub issueFor @stas00:
High Level Goal: allow large Seq2Seq transformers, (many of which inherit from BART) to be run on/accross multiple GPUs with model-parallelism.
- This is a prerequisite for adding m2m100, and can be done in the same or a separate PR.I prefer separate given all the boilerplate associated with new model additions.
- This has been attempted for GPT-2, and is days-weeks away from merging: #7772 .
- fairseq has a different scheme, as shown by the last 4 clargs in this command
- That model requires more hardware than you have locally, which is another good reason to try a 2 GPU case first, then try to As such, the test you should try to get passing is roughly:
n = 20 # set so that model.cuda() OOMs on 1 GPU (in a few lines)
cfg = BartConfig(encoder_layers=n, decoder_layers=n)
model = BartForConditonalGeneration(cfg)
# this device_map is taken from #7772, feel free to make your own signature, like
model.cuda()
model.generate(**batch) # should OOM here or the line before
device_map = {0: [0, 1, 2, 3, 4, 5, 6, 7, 8],
1: [9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21],
2: [22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34],
3: [35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47]}
model.parallelize(device_map) # i might call this model.split
batch = tokenizer(['I am a small frog'])
model.save_pretrained('parallelized_model')
model.from_pretrained('parallelized_model') # should be parallelized
model.deparallelize() # Puts the model back on cpu and calls torch.cuda.empty_cache() to liberate GPU memory
Requirements
- user can save_pretrained/load_pretrained without losing the partitioning of layers -> devices.
- some forward/generate call that would
OOM
on a single GPU does notOOM
after calling.parallelize
- user can repartition by loading full model to cpu, calling
model.parallelize(device_map)
Brain Dump
- You should read the whole document https://github.com/pytorch/fairseq/tree/master/examples/m2m_100#beyond-english-centric-multilingual-machine-translation before starting before starting
- I would also read discussion, code for https://github.com/huggingface/transformers/pull/7772
- you have a lot of flexibility in naming/API and should feel empowered to make choices as you see fit.
- To the extent possible, keep the PR small and avoid interfering existing single-gpu functionality.
- You could add a fairscale dependency during your experiments/local dev, but it would be a battle to get
fairscale
added as a dependency. If that is a worthwhile battle, however, you should argue for it. - I suspect that this will take nearly as long as
FSMT
, but be much less code.
What do you think?
Issue Analytics
- State:
- Created 3 years ago
- Reactions:1
- Comments:9 (5 by maintainers)
Top Results From Across the Web
Model Parallelism — transformers 4.7.0 documentation
Naive Model Parallel (MP) is where one spreads groups of model layers across multiple GPUs. The mechanism is relatively simple - switch the...
Read more >Hugging Face on Twitter: " Distributed Training Alert for ...
SageMaker Data-Parallelism library with Transformers to train your Seq2Seq-models in a distributed fashion . huggingface.co.
Read more >Megatron-LM: Training Multi-Billion Parameter Language ...
In this work, we implement a simple and efficient model parallel approach using intra-layer model-parallelism. We exploit the inherent structure in transformer ...
Read more >BART Toolbox - GitHub Pages
The Berkeley Advanced Reconstruction Toolbox (BART) toolbox is a free and ... and reconstruction algorithms for parallel imaging and compressed sensing.
Read more >Distributed fine-tuning of a BERT Large model for a Question ...
model parallelism. Data parallelism is typically more appropriate but not necessarily restricted to when training is bottlenecked by compute, ...
Read more >
Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free
Top Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
When you don’t use HF Trainer, you’re on your own, as you’re outside of the domain of HF Transformers. For non-HF-Trainer use we only provide a way for
from_pretrained
to load the model directly to multiple gpu viazero.Init
and that’s what the link you added points to.Basically you have to study the deepspeed documentation https://www.deepspeed.ai/ - and follow their documentation.
This line of work has been abandoned as it’s highly inefficient. Please use DeeepSpeed which works with any model https://huggingface.co/docs/transformers/main/main_classes/deepspeed