question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

model parallelism for BART

See original GitHub issue

For @stas00:

High Level Goal: allow large Seq2Seq transformers, (many of which inherit from BART) to be run on/accross multiple GPUs with model-parallelism.

  • This is a prerequisite for adding m2m100, and can be done in the same or a separate PR.I prefer separate given all the boilerplate associated with new model additions.
  • This has been attempted for GPT-2, and is days-weeks away from merging: #7772 .
  • fairseq has a different scheme, as shown by the last 4 clargs in this command
  • That model requires more hardware than you have locally, which is another good reason to try a 2 GPU case first, then try to As such, the test you should try to get passing is roughly:
n = 20 # set so that model.cuda() OOMs on 1 GPU (in a few lines)
cfg = BartConfig(encoder_layers=n, decoder_layers=n)
model = BartForConditonalGeneration(cfg)
# this device_map is taken from #7772, feel free to make your own signature, like 

model.cuda()
model.generate(**batch) # should OOM here or the line before

device_map = {0: [0, 1, 2, 3, 4, 5, 6, 7, 8],
                          1: [9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21],
                          2: [22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34],
                          3: [35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47]}
model.parallelize(device_map)  # i might call this model.split
batch = tokenizer(['I am a small frog'])

model.save_pretrained('parallelized_model')
model.from_pretrained('parallelized_model') # should be parallelized
model.deparallelize() # Puts the model back on cpu and calls torch.cuda.empty_cache() to liberate GPU memory

Requirements

  • user can save_pretrained/load_pretrained without losing the partitioning of layers -> devices.
  • some forward/generate call that would OOM on a single GPU does not OOM after calling .parallelize
  • user can repartition by loading full model to cpu, calling model.parallelize(device_map)

Brain Dump

What do you think?

Issue Analytics

  • State:closed
  • Created 3 years ago
  • Reactions:1
  • Comments:9 (5 by maintainers)

github_iconTop GitHub Comments

1reaction
stas00commented, Jun 5, 2022

When you don’t use HF Trainer, you’re on your own, as you’re outside of the domain of HF Transformers. For non-HF-Trainer use we only provide a way for from_pretrained to load the model directly to multiple gpu via zero.Init and that’s what the link you added points to.

Basically you have to study the deepspeed documentation https://www.deepspeed.ai/ - and follow their documentation.

1reaction
stas00commented, Jun 5, 2022

This line of work has been abandoned as it’s highly inefficient. Please use DeeepSpeed which works with any model https://huggingface.co/docs/transformers/main/main_classes/deepspeed

Read more comments on GitHub >

github_iconTop Results From Across the Web

Model Parallelism — transformers 4.7.0 documentation
Naive Model Parallel (MP) is where one spreads groups of model layers across multiple GPUs. The mechanism is relatively simple - switch the...
Read more >
Hugging Face on Twitter: " Distributed Training Alert for ...
SageMaker Data-Parallelism library with Transformers to train your Seq2Seq-models in a distributed fashion . huggingface.co.
Read more >
Megatron-LM: Training Multi-Billion Parameter Language ...
In this work, we implement a simple and efficient model parallel approach using intra-layer model-parallelism. We exploit the inherent structure in transformer ...
Read more >
BART Toolbox - GitHub Pages
The Berkeley Advanced Reconstruction Toolbox (BART) toolbox is a free and ... and reconstruction algorithms for parallel imaging and compressed sensing.
Read more >
Distributed fine-tuning of a BERT Large model for a Question ...
model parallelism. Data parallelism is typically more appropriate but not necessarily restricted to when training is bottlenecked by compute, ...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found