Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Running Stable Diffusion in FastAPI Does Not Release GPU Memory

See original GitHub issue

Describe the bug

I am running Stable Diffusionas as a web service using FastAPI. It runs fine, but after doing multiple inference calls, I noticed the memory of the GPU becomes full and the inference fails. It is as if the memory is not released right after doing the inference.

Reproduction

requirements.txt

--extra-index-url https://download.pytorch.org/whl/cu116
diffusers==0.3.0
fastapi==0.80.0
pydantic==1.9.2
torch==1.12.1
transformers==4.21.2
uvicorn==0.18.3

main.py

import logging
import os
import random
import time
import torch
from diffusers import StableDiffusionPipeline
from fastapi import FastAPI, HTTPException, Request
from fastapi.staticfiles import StaticFiles
from pydantic import BaseModel
from typing import List, Optional


# Load default logging configuration
logging.basicConfig(level=logging.INFO, format='%(asctime)s - %(name)s - %(levelname)s - %(message)s')
log = logging.getLogger(__name__)

# Load Stable Diffusion model
log.info('Load Stable Diffusion model')
model_path = './models/stable-diffusion-v1-4'
pipe = StableDiffusionPipeline.from_pretrained(
    model_path,
    revision='fp16',
    torch_dtype=torch.float16
)

# Move pipeline to GPU for faster inference
pipe = pipe.to('cuda')
pipe.enable_attention_slicing()

# Declare inputs and outputs data types for the API endpoint
class Payload(BaseModel):
    prompt: str                 # String of text used to generate the images
    num_images = 1              # Number of images to be generated
    height = 512                # Height of the images to be generated
    width = 512                 # Width of the images to be generated
    seed: Optional[int] = None  # Random integer used as a seed to guide the image generator
    num_steps = 40              # Number of inference steps, results are better the more steps you use, at a cost of slower inference
    guidance_scale = 8.5        # Forces generation to better match the prompt, 7 or 8.5 give good results, results are better the larger the number is, but will be less diverse

class Response(BaseModel):
    images: List[str]
    nsfw_content_detected: List[bool]
    prompt: str
    num_images: int
    height: int
    width: int
    seed: int
    num_steps: int
    guidance_scale: float

# Create FastAPI app
log.info('Start API')
app = FastAPI(title='Stable Diffusion')
app.mount("/static", StaticFiles(directory="./static"), name="static") # Mount folder to expose generated images

# Declare imagine endpoint for inference
@app.post('/imagine', response_model=Response, description='Runs inferences with Stable Diffusion.')
def imagine(payload: Payload, request: Request):
    """The imagine function generates the /imagine endpoint and runs inferences"""

    try:
        # Check payload
        log.info(f'Payload: {payload}')

        # Default seed with a random integer if it is not provided by user
        if payload.seed is None:
            payload.seed = random.randint(-999999999, 999999999)
        generator = torch.Generator('cuda').manual_seed(payload.seed)

        # Create multiple prompts according to the number of images
        prompt = [payload.prompt] * payload.num_images

        # Run inference on GPU
        log.info('Run inference')
        with torch.autocast('cuda'):
            result = pipe(
                prompt=prompt,
                height=payload.height,
                width=payload.width,
                num_inference_steps=payload.num_steps,
                guidance_scale=payload.guidance_scale,
                generator=generator
            )
        log.info('Inference completed')

        # Save images
        images_urls = []
        for image in result.images:
            image_name = str(time.time()).replace('.', '') + '.png'
            image_path = os.path.join('static', image_name)
            image.save(image_path)
            image_url = request.url_for('static', path=image_name)
            images_urls.append(image_url)

        # Build response object
        response = {}
        response['images'] = images_urls
        response['nsfw_content_detected'] = result['nsfw_content_detected']
        response['prompt'] = payload.prompt
        response['num_images'] = payload.num_images
        response['height'] = payload.height
        response['width'] = payload.width
        response['seed'] = payload.seed
        response['num_steps'] = payload.num_steps
        response['guidance_scale'] = payload.guidance_scale

        return response

    except Exception as e:
        log.error(repr(e))
        raise HTTPException(status_code=500, detail=repr(e))

Command to run the FastAPI: uvicorn main:app --host 0.0.0.0 --port 5000

Logs

No response

System Info

diffusers version: 0.3.0
Platform: Linux-5.4.72-microsoft-standard-WSL2-x86_64-with-glibc2.29
Python version: 3.8.10
PyTorch version (GPU?): 1.12.1+cu116 (True)
Huggingface_hub version: 0.10.1
Transformers version: 4.21.2
Using GPU in script?: Yes
Using distributed or parallel set-up in script?: No

Issue Analytics

State:
Created a year ago
Comments:11 (7 by maintainers)

Top GitHub Comments

1reaction

ydshiehcommented, Oct 20, 2022

There is something similar

https://github.com/huggingface/accelerate/issues/614#issuecomment-1224213502 (I tried locally, that issue author using Flask)

As this issue author already finds a solution using torch.cuda.empty_cache() similar to what I have done above, and also an extra torch.cuda.ipc_collect() (I should look this, thanks), I think everything is fine.

Here are some comments:

(It’s always a good idea to try a local version to see if the issue occurs without via a web framework)
On our side, it’s not very reasonable to add torch.cuda.empty_cache() in the call method(s). We can do it at a few places (like during loading a model), but not in a call method.
Also, the cache is managed by torch, and when you see it increases, it doesn’t always mean there will be an issue. As long as that python process where torch is resides, torch will do something when new GPU memory needs to be allocated. The problem occurs when other processes want to use GPU, or the same process but other frameworks (like TensorFlow) needs GPU.

0reactions

github-actions[bot]commented, Nov 16, 2022

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.