Optimize model loading by natively using Accelerate, please.
See original GitHub issueIs your feature request related to a problem? Please describe.
Loading pretrained models, for example UNet2DConditionModel
peaks at 8GB when loading a pretrained one, while it takes a lot less memory when it is fully loaded to the GPU. That is happens likely because there are some redundancies between loaded weights and the model.
When we start it with empty weights and load them using accelerate, it peaks on 4.91GB, which lets us deploy Diffusers in servers that have a lot less RAM and thus are cheaper.
Describe the solution you’d like
It would be great if Diffusers natively supported loading models from pretrained using accelerate.init_empty_weights
and accelerate.accelerate.load_checkpoint_and_dispatch
, so it has lower memory footprint.
Describe alternatives you’ve considered
- Loading the models with accelerate myself, but that requires model surgery and a lot of coding, which makes me prone to mistakes
- Do what accelerate does by hand
Additional context Using accelerate to natively load models would also reduce the RAM memory footprint when loading pipelines composed of more than one model, by already allocating them on the proper device and thus enabling us to deploy diffusers on cheaper servers.
Issue Analytics
- State:
- Created a year ago
- Comments:16 (15 by maintainers)
Top GitHub Comments
Happy to include
accelerate
! Maybe we can follow the same logic we have intransformers
withdevice_map="auto"
😃 And definitely +1 to make use of torch’s meta device to half the peak required memory usage!Hey @piEsposito, happy to look into that in the coming week! If you have a code snippet that works for you, feel free to include it here as a reference 🤗
@muellerzr @patrickvonplaten wdyt? We can either do this, or bring the low-RAM loading code from transformers