question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Optimize model loading by natively using Accelerate, please.

See original GitHub issue

Is your feature request related to a problem? Please describe. Loading pretrained models, for example UNet2DConditionModel peaks at 8GB when loading a pretrained one, while it takes a lot less memory when it is fully loaded to the GPU. That is happens likely because there are some redundancies between loaded weights and the model.

When we start it with empty weights and load them using accelerate, it peaks on 4.91GB, which lets us deploy Diffusers in servers that have a lot less RAM and thus are cheaper.

Describe the solution you’d like It would be great if Diffusers natively supported loading models from pretrained using accelerate.init_empty_weights and accelerate.accelerate.load_checkpoint_and_dispatch, so it has lower memory footprint.

Describe alternatives you’ve considered

  • Loading the models with accelerate myself, but that requires model surgery and a lot of coding, which makes me prone to mistakes
  • Do what accelerate does by hand

Additional context Using accelerate to natively load models would also reduce the RAM memory footprint when loading pipelines composed of more than one model, by already allocating them on the proper device and thus enabling us to deploy diffusers on cheaper servers.

Issue Analytics

  • State:closed
  • Created a year ago
  • Comments:16 (15 by maintainers)

github_iconTop GitHub Comments

2reactions
patrickvonplatencommented, Sep 2, 2022

Happy to include accelerate! Maybe we can follow the same logic we have in transformers with device_map="auto" 😃 And definitely +1 to make use of torch’s meta device to half the peak required memory usage!

2reactions
anton-lcommented, Aug 31, 2022

Hey @piEsposito, happy to look into that in the coming week! If you have a code snippet that works for you, feel free to include it here as a reference 🤗

@muellerzr @patrickvonplaten wdyt? We can either do this, or bring the low-RAM loading code from transformers

Read more comments on GitHub >

github_iconTop Results From Across the Web

Handling big models - Hugging Face
In plain English, those steps are: Create the model with randomly initialized weights; Load the model weights (in a dictionary usually called a...
Read more >
5 Practical Ways to Speed Up your Deep Learning Model
Are you lost on how to optimize your model's inference speed? Then this post is for you. Data Science projects have the peculiarity...
Read more >
ML Frameworks: Hugging Face Accelerate w/ Sylvain Gugger
In this edition of ML Frameworks W&B's host Aman Arora was joined by HF Research Engineer, Sylvain Gugger.---HF Accelerate was created for ...
Read more >
How to Speed Up Deep Learning Inference Using TensorRT
The first step is to import the model, which includes loading it from a saved file on disk and converting it to a...
Read more >
ONNX models: Optimize inference - Azure Machine Learning
Learn how using the Open Neural Network Exchange (ONNX) can help ... ONNX and Azure Machine Learning: Create and accelerate ML models.
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found