Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

model parallelism for BART

See original GitHub issue

For @stas00:

High Level Goal: allow large Seq2Seq transformers, (many of which inherit from BART) to be run on/accross multiple GPUs with model-parallelism.

This is a prerequisite for adding m2m100, and can be done in the same or a separate PR.I prefer separate given all the boilerplate associated with new model additions.
This has been attempted for GPT-2, and is days-weeks away from merging: #7772 .
fairseq has a different scheme, as shown by the last 4 clargs in this command
That model requires more hardware than you have locally, which is another good reason to try a 2 GPU case first, then try to As such, the test you should try to get passing is roughly:

n = 20 # set so that model.cuda() OOMs on 1 GPU (in a few lines)
cfg = BartConfig(encoder_layers=n, decoder_layers=n)
model = BartForConditonalGeneration(cfg)
# this device_map is taken from #7772, feel free to make your own signature, like 

model.cuda()
model.generate(**batch) # should OOM here or the line before

device_map = {0: [0, 1, 2, 3, 4, 5, 6, 7, 8],
                          1: [9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21],
                          2: [22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34],
                          3: [35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47]}
model.parallelize(device_map)  # i might call this model.split
batch = tokenizer(['I am a small frog'])

model.save_pretrained('parallelized_model')
model.from_pretrained('parallelized_model') # should be parallelized
model.deparallelize() # Puts the model back on cpu and calls torch.cuda.empty_cache() to liberate GPU memory

Requirements

user can save_pretrained/load_pretrained without losing the partitioning of layers -> devices.
some forward/generate call that would OOM on a single GPU does not OOM after calling .parallelize
user can repartition by loading full model to cpu, calling model.parallelize(device_map)

Brain Dump

You should read the whole document https://github.com/pytorch/fairseq/tree/master/examples/m2m_100#beyond-english-centric-multilingual-machine-translation before starting before starting
I would also read discussion, code for https://github.com/huggingface/transformers/pull/7772
you have a lot of flexibility in naming/API and should feel empowered to make choices as you see fit.
To the extent possible, keep the PR small and avoid interfering existing single-gpu functionality.
You could add a fairscale dependency during your experiments/local dev, but it would be a battle to get fairscale added as a dependency. If that is a worthwhile battle, however, you should argue for it.
I suspect that this will take nearly as long as FSMT, but be much less code.

What do you think?

Issue Analytics

State:
Created 3 years ago
Reactions:1
Comments:9 (5 by maintainers)

Top GitHub Comments

1reaction

stas00commented, Jun 5, 2022

When you don’t use HF Trainer, you’re on your own, as you’re outside of the domain of HF Transformers. For non-HF-Trainer use we only provide a way for from_pretrained to load the model directly to multiple gpu via zero.Init and that’s what the link you added points to.

Basically you have to study the deepspeed documentation https://www.deepspeed.ai/ - and follow their documentation.

1reaction

stas00commented, Jun 5, 2022

This line of work has been abandoned as it’s highly inefficient. Please use DeeepSpeed which works with any model https://huggingface.co/docs/transformers/main/main_classes/deepspeed

Top Results From Across the Web

Model Parallelism — transformers 4.7.0 documentation

Naive Model Parallel (MP) is where one spreads groups of model layers across multiple GPUs. The mechanism is relatively simple - switch the...

Hugging Face on Twitter: " Distributed Training Alert for ...

SageMaker Data-Parallelism library with Transformers to train your Seq2Seq-models in a distributed fashion . huggingface.co.

Megatron-LM: Training Multi-Billion Parameter Language ...

In this work, we implement a simple and efficient model parallel approach using intra-layer model-parallelism. We exploit the inherent structure in transformer ...

BART Toolbox - GitHub Pages

The Berkeley Advanced Reconstruction Toolbox (BART) toolbox is a free and ... and reconstruction algorithms for parallel imaging and compressed sensing.

Distributed fine-tuning of a BERT Large model for a Question ...

model parallelism. Data parallelism is typically more appropriate but not necessarily restricted to when training is bottlenecked by compute, ...