Running Stable Diffusion in FastAPI Does Not Release GPU Memory
See original GitHub issueDescribe the bug
I am running Stable Diffusionas as a web service using FastAPI. It runs fine, but after doing multiple inference calls, I noticed the memory of the GPU becomes full and the inference fails. It is as if the memory is not released right after doing the inference.
Reproduction
requirements.txt
--extra-index-url https://download.pytorch.org/whl/cu116
diffusers==0.3.0
fastapi==0.80.0
pydantic==1.9.2
torch==1.12.1
transformers==4.21.2
uvicorn==0.18.3
main.py
import logging
import os
import random
import time
import torch
from diffusers import StableDiffusionPipeline
from fastapi import FastAPI, HTTPException, Request
from fastapi.staticfiles import StaticFiles
from pydantic import BaseModel
from typing import List, Optional
# Load default logging configuration
logging.basicConfig(level=logging.INFO, format='%(asctime)s - %(name)s - %(levelname)s - %(message)s')
log = logging.getLogger(__name__)
# Load Stable Diffusion model
log.info('Load Stable Diffusion model')
model_path = './models/stable-diffusion-v1-4'
pipe = StableDiffusionPipeline.from_pretrained(
model_path,
revision='fp16',
torch_dtype=torch.float16
)
# Move pipeline to GPU for faster inference
pipe = pipe.to('cuda')
pipe.enable_attention_slicing()
# Declare inputs and outputs data types for the API endpoint
class Payload(BaseModel):
prompt: str # String of text used to generate the images
num_images = 1 # Number of images to be generated
height = 512 # Height of the images to be generated
width = 512 # Width of the images to be generated
seed: Optional[int] = None # Random integer used as a seed to guide the image generator
num_steps = 40 # Number of inference steps, results are better the more steps you use, at a cost of slower inference
guidance_scale = 8.5 # Forces generation to better match the prompt, 7 or 8.5 give good results, results are better the larger the number is, but will be less diverse
class Response(BaseModel):
images: List[str]
nsfw_content_detected: List[bool]
prompt: str
num_images: int
height: int
width: int
seed: int
num_steps: int
guidance_scale: float
# Create FastAPI app
log.info('Start API')
app = FastAPI(title='Stable Diffusion')
app.mount("/static", StaticFiles(directory="./static"), name="static") # Mount folder to expose generated images
# Declare imagine endpoint for inference
@app.post('/imagine', response_model=Response, description='Runs inferences with Stable Diffusion.')
def imagine(payload: Payload, request: Request):
"""The imagine function generates the /imagine endpoint and runs inferences"""
try:
# Check payload
log.info(f'Payload: {payload}')
# Default seed with a random integer if it is not provided by user
if payload.seed is None:
payload.seed = random.randint(-999999999, 999999999)
generator = torch.Generator('cuda').manual_seed(payload.seed)
# Create multiple prompts according to the number of images
prompt = [payload.prompt] * payload.num_images
# Run inference on GPU
log.info('Run inference')
with torch.autocast('cuda'):
result = pipe(
prompt=prompt,
height=payload.height,
width=payload.width,
num_inference_steps=payload.num_steps,
guidance_scale=payload.guidance_scale,
generator=generator
)
log.info('Inference completed')
# Save images
images_urls = []
for image in result.images:
image_name = str(time.time()).replace('.', '') + '.png'
image_path = os.path.join('static', image_name)
image.save(image_path)
image_url = request.url_for('static', path=image_name)
images_urls.append(image_url)
# Build response object
response = {}
response['images'] = images_urls
response['nsfw_content_detected'] = result['nsfw_content_detected']
response['prompt'] = payload.prompt
response['num_images'] = payload.num_images
response['height'] = payload.height
response['width'] = payload.width
response['seed'] = payload.seed
response['num_steps'] = payload.num_steps
response['guidance_scale'] = payload.guidance_scale
return response
except Exception as e:
log.error(repr(e))
raise HTTPException(status_code=500, detail=repr(e))
Command to run the FastAPI: uvicorn main:app --host 0.0.0.0 --port 5000
Logs
No response
System Info
diffusers
version: 0.3.0- Platform: Linux-5.4.72-microsoft-standard-WSL2-x86_64-with-glibc2.29
- Python version: 3.8.10
- PyTorch version (GPU?): 1.12.1+cu116 (True)
- Huggingface_hub version: 0.10.1
- Transformers version: 4.21.2
- Using GPU in script?: Yes
- Using distributed or parallel set-up in script?: No
Issue Analytics
- State:
- Created a year ago
- Comments:11 (7 by maintainers)
Top Results From Across the Web
Running Stable Diffusion in FastAPI Container Does Not ...
Show activity on this post. I am running Stable Diffusion in a FastAPI Docker container. It runs fine, but after doing multiple inference...
Read more >How to Run Stable Diffusion in Docker with a Simple Web API ...
Launch a web API for Stable Diffusion under 45 seconds ... Stable Diffusion is a latent text-to-image diffusion model, made possible thanks to...
Read more >Deploy stable diffusion on GPU instance using FastAPI
In this blog, let's explore how we can deploy a Stable diffusion model on a GPU and expose it as an API. To...
Read more >Stable Diffusion Runtime Error: How To Fix CUDA Out Of ...
It appears you have run out of GPU memory. It is worth mentioning that you need at least 4 GB VRAM in order...
Read more >kuprel/min-dalle · Running out of GPU memory - Hugging Face
This may be a strange interaction between FastAPI (used by gradio) and torch. I can recreate the same out-of-memory problem with this script ......
Read more >Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start FreeTop Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Top GitHub Comments
There is something similar
https://github.com/huggingface/accelerate/issues/614#issuecomment-1224213502 (I tried locally, that issue author using Flask)
As this issue author already finds a solution using
torch.cuda.empty_cache()
similar to what I have done above, and also an extratorch.cuda.ipc_collect()
(I should look this, thanks), I think everything is fine.Here are some comments:
torch.cuda.empty_cache()
in the call method(s). We can do it at a few places (like during loading a model), but not in a call method.torch
, and when you see it increases, it doesn’t always mean there will be an issue. As long as that python process wheretorch
is resides,torch
will do something when new GPU memory needs to be allocated. The problem occurs when other processes want to use GPU, or the same process but other frameworks (like TensorFlow) needs GPU.This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.
Please note that issues that do not follow the contributing guidelines are likely to be ignored.