Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Passing multiple models with DeepSpeed will fail

See original GitHub issue

My accelerate config

In which compute environment are you running? ([0] This machine, [1] AWS (Amazon SageMaker)): 0 Which type of machine are you using? ([0] No distributed training, [1] multi-CPU, [2] multi-GPU, [3] TPU): 0 How many different machines will you use (use more than 1 for multi-node training)? [1]: 1 Do you want to use DeepSpeed? [yes/NO]: yes How many processes in total will you use? [1]: 1 Do you wish to use FP16 (mixed precision)? [yes/NO]: NO

Enviroment Info

Machine Info : V100 X 1 accerlerate version : 0.5.1

(semi-)reproducible code

model1 = torch.nn.Transformer()
model2 = torch.nn.Transformer()
opt = torch.optim.Adam(...)
loader = ...

model1, model2, opt, loader = accelerator.prepare(model1, model2, opt, loader)

Additional Explanation

Using DeepSpeed passing multiple models to prepare will fail, i.e. all the models will become the same as the last passed. This is due to how _prepare_deepspeed handles the arguments, especially:

for obj in result:
            if isinstance(obj, torch.nn.Module):
                model = obj
            elif isinstance(obj, (torch.optim.Optimizer, dict)):
                optimizer = obj

In this way model will take only the last nn.Module object passed and therefore the engine object created later on will be wrong.

Issue Analytics

State:
Created 2 years ago
Comments:5 (1 by maintainers)

Top GitHub Comments

1reaction

vanakemacommented, Sep 21, 2022

Any update on this? I’m trying to train stable diffusion using the Dreambooth technique, and I was hoping to use Deepspeed to enable me to train on my RTX 3090 as I’m running out of memory. I found the issue was that the StableDiffusionPipeline is not a torch.nn.Module because it’s composed of multiple models. I thought a solution might be to call accelerator.prepare() multiple times, once for each of the models that we’re training, but I feel like that has the possibility to create multiple Deepspeed instances that wouldn’t work in concert with each other. Any advice for folks in my situation?

0reactions

pacman100commented, Sep 21, 2022

Hello @vanakema, we looked into support for multiple models with DeepSpeed but it wasn’t possible for the following reasons:

User only provides a single set of DeepSpeed config plugin/DeepSpeed config files corresponding to a single model. Ideally, the user should have different DeepSpeed configs for multiple models, and this is a niche scenario.
DeepSpeed needs to keep track of the model, its optimizer and scheduler and therefore only one global DeepSpeed engine wrapper to control the backward and optimizer/scheduler step.

For this scenario, can you please try out the PyTorch FSDP integration. It should follow below format.
Notes:

Same FSDP config would be applicable to both models.
Pass the optimizers to the prepare call in the same order as the corresponding models. This is because when using accelerator.save_state, optimizer requires the corresponding model.

# prepare all models before creating optimizers as FSDP flatten model params which breaks already created optimizer and the fact that it is efficient to prepare the model before creating optimizer
model1 = MyCoolModel1()
model2 = MyCoolModel2()
model1, model2 = accelerator.prepare(model1, model2)

opt1 = torch.optim.AdamW(model1.parameters(), lr1)
opt2 = torch.optim.AdamW(model2.parameters(), lr1)
scheduler1 = get_linear_schedule_with_warmup(opt1)
scheduler2 = get_linear_schedule_with_warmup(opt2)
opt1, opt2, scheduler1, scheduler2 = accelerator.prepare(opt1, opt2, scheduler1, scheduler2)

# remaining training loop and eval loop
...

Top Results From Across the Web

DeepSpeed Integration - Hugging Face

DeepSpeed ZeRO-3 can be used for inference as well, since it allows huge models to be loaded on multiple GPUs, which won't be...

DeepSpeed: All the tricks to scale to gigantic models - YouTube

Referenceshttps://github.com/microsoft/DeepSpeedhttps://github.com/NVIDIA/Megatron-LMhttps://github.com/cybertronai/gradient- ...

ZeRO — DeepSpeed 0.8.0 documentation - Read the Docs

The Zero Redundancy Optimizer (ZeRO) removes the memory redundancies across data-parallel processes by partitioning the three model states (optimizer states, ...

Getting Started with DeepSpeed for Inferencing Transformer ...

To run inference on multi-GPU for compatible models, provide the model ... to True and pass int the replace_method as 'auto' for the...

ZeRO & Fastest BERT: Increasing the scale and speed of ...

How to use DeepSpeed to train your own model and other popular models like BERT and GPT- 2 ; A deep dive into...