Passing multiple models with DeepSpeed will fail
See original GitHub issueMy accelerate config
In which compute environment are you running? ([0] This machine, [1] AWS (Amazon SageMaker)): 0 Which type of machine are you using? ([0] No distributed training, [1] multi-CPU, [2] multi-GPU, [3] TPU): 0 How many different machines will you use (use more than 1 for multi-node training)? [1]: 1 Do you want to use DeepSpeed? [yes/NO]: yes How many processes in total will you use? [1]: 1 Do you wish to use FP16 (mixed precision)? [yes/NO]: NO
Enviroment Info
Machine Info : V100 X 1 accerlerate version : 0.5.1
(semi-)reproducible code
model1 = torch.nn.Transformer()
model2 = torch.nn.Transformer()
opt = torch.optim.Adam(...)
loader = ...
model1, model2, opt, loader = accelerator.prepare(model1, model2, opt, loader)
Additional Explanation
Using DeepSpeed passing multiple models to prepare will fail, i.e. all the models will become the same as the last passed.
This is due to how _prepare_deepspeed
handles the arguments, especially:
for obj in result:
if isinstance(obj, torch.nn.Module):
model = obj
elif isinstance(obj, (torch.optim.Optimizer, dict)):
optimizer = obj
In this way model
will take only the last nn.Module
object passed and therefore the engine
object created later on will be wrong.
Issue Analytics
- State:
- Created 2 years ago
- Comments:5 (1 by maintainers)
Top GitHub Comments
Any update on this? I’m trying to train stable diffusion using the Dreambooth technique, and I was hoping to use Deepspeed to enable me to train on my RTX 3090 as I’m running out of memory. I found the issue was that the
StableDiffusionPipeline
is not atorch.nn.Module
because it’s composed of multiple models. I thought a solution might be to call accelerator.prepare() multiple times, once for each of the models that we’re training, but I feel like that has the possibility to create multiple Deepspeed instances that wouldn’t work in concert with each other. Any advice for folks in my situation?Hello @vanakema, we looked into support for multiple models with DeepSpeed but it wasn’t possible for the following reasons:
backward
and optimizer/schedulerstep
.For this scenario, can you please try out the PyTorch FSDP integration. It should follow below format.
Notes:
accelerator.save_state
, optimizer requires the corresponding model.