question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

[BUG] Gunicorn Workers Hangs And Consumes Memory Forever

See original GitHub issue

Describe the bug I have deployed FastAPI which queries the database and returns the results. I made sure closing the DB connection and all. I’m running gunicorn with this line ; gunicorn -w 8 -k uvicorn.workers.UvicornH11Worker -b 0.0.0.0 app:app --timeout 10 So after exposing it to the web, I run a load test which makes 30-40 requests in parallel to the fastapi. And the problem starts here. I’m watching the ‘HTOP’ in the mean time and I see that RAM usage is always growing, seems like no task is killed after completing it’s job. Then I checked the Task numbers, same goes for it too, seems like gunicorn workers do not get killed. After some time RAM usage gets at it’s maximum, and starts to throw errors. So I killed the gunicorn app but the thing is processes spawned by main gunicorn proces did not get killed and still using all the memory.

Environment:

  • OS: Ubuntu 18.04

  • FastAPI Version : 0.38.1

  • Python version : 3.7.4

Issue Analytics

  • State:open
  • Created 4 years ago
  • Comments:63 (16 by maintainers)

github_iconTop GitHub Comments

31reactions
ZackJiang21commented, Aug 26, 2021

Hi everyone,

I just read the source code of fastAPI and test it myself. First of all, this should not be a memory leak issue, but the problem is if your machine has a lot of CPUs, it will occupy a lot of memory.

The only difference is in starlette.routing.py methodrequest_response()

async def run_endpoint_function(
    *, dependant: Dependant, values: Dict[str, Any], is_coroutine: bool
) -> Any:
    # Only called by get_request_handler. Has been split into its own function to
    # facilitate profiling endpoints, since inner functions are harder to profile.
    assert dependant.call is not None, "dependant.call must be a function"

    if is_coroutine:
        return await dependant.call(**values)
    else:
        return await run_in_threadpool(dependant.call, **values)
  
 
async def run_in_threadpool(
    func: typing.Callable[..., T], *args: typing.Any, **kwargs: typing.Any
) -> T:
    loop = asyncio.get_event_loop()
    if contextvars is not None:  # pragma: no cover
        # Ensure we run in the same context
        child = functools.partial(func, *args, **kwargs)
        context = contextvars.copy_context()
        func = context.run
        args = (child,)
    elif kwargs:  # pragma: no cover
        # loop.run_in_executor doesn't accept 'kwargs', so bind them in here
        func = functools.partial(func, **kwargs)
    return await loop.run_in_executor(None, func, *args)

If the your rest interface is not async, it will run in loop.run_in_executor, but starlette do not specify the executor here, so the default thread pool size should be os.cpu_count() * 5, my test machine has 40 cpus so I should have 200 threads in the pool. And after each request it will not release the object in these threads, unless the thread be reused by next request, which will occupy a lot of memory, but at the end it’s not memory leak.

below is my test code if you want to reproduce it

import asyncio

import cv2 as cv
import gc
from pympler import tracker
from concurrent import futures

# you can change worker number here
executor = futures.ThreadPoolExecutor(max_workers=1)

memory_tracker = tracker.SummaryTracker()

def mm():
    img = cv.imread("cap.jpg", 0)
    detector = cv.AKAZE_create()
    kpts, desc = detector.detectAndCompute(img, None)
    gc.collect()
    memory_tracker.print_diff()
    return None

async def main():
    while True:
        loop = asyncio.get_event_loop()
        await loop.run_in_executor(executor, mm)


if __name__=='__main__':
    loop = asyncio.get_event_loop()
    loop.run_until_complete(main())

Even though it’s not memory leak, I still think it’s not a good implementation cuz it’s sensitive to your cpu count and when you run large deep learning model in fastAPI, you will find it occupy a ton of memory. So I suggest could we make the thread pool size configurable?

If you are interested in my process reading the source code, pls refer to my blog and give me a like(https://www.jianshu.com/p/e4595c48d091)

Sorry for only write blogs in Chinese 😃

Current Solution

  1. python 3.9 already limit the threads in thread pool as below,

         if max_workers is None:
                # ThreadPoolExecutor is often used to:
                # * CPU bound task which releases GIL
                # * I/O bound task (which releases GIL, of course)
                #
                # We use cpu_count + 4 for both types of tasks.
                # But we limit it to 32 to avoid consuming surprisingly large resource
                # on many core machine.
                max_workers = min(32, (os.cpu_count() or 1) + 4)
            if max_workers <= 0:
                raise ValueError("max_workers must be greater than 0")
    

    If for ur program, 32 thread is not too large, you can upgrade python to 3.9 to avoid this issue.

  2. Use async to define ur interface, then the request will run in an event loop, but the throughput maybe infected.

11reactions
kevchentwcommented, May 28, 2020

Some statistics for python3.7, python3.8, and async.

Initial Mem Usage
==========================================
fastapi-py37: 76.21MiB / 7.353GiB
fastapi-py38: 75.86MiB / 7.353GiB
fastapi-py37-async: 75.44MiB / 7.353GiB
fastapi-py38-async: 75.62MiB / 7.353GiB
==========================================
Run 1000 Requests....
==========================================
Run fastapi-py37
real: 0m16.632s; user 0m4.748s; system 0m2.855s
Run fastapi-py38
real: 0m15.319s; user 0m4.750s; system 0m2.722s
Run fastapi-py37-async
real: 0m21.276s; user 0m4.877s; system 0m2.823s
Run fastapi-py38-async
real: 0m22.568s; user 0m5.218s; system 0m2.935s
==========================================
After 1000 Requests Mem Usage
==========================================
fastapi-py37: 1.266GiB / 7.353GiB
fastapi-py38: 144.8MiB / 7.353GiB
fastapi-py37-async: 84.07MiB / 7.353GiB
fastapi-py38-async: 83.63MiB / 7.353GiB
==========================================
Read more comments on GitHub >

github_iconTop Results From Across the Web

Gunicorn worker doesn't deflate memory after request
I have a single gunicorn worker process running to read an enormous excel file which takes up to 5 minutes and uses 4GB...
Read more >
Does Gunicorn Thread Take More Ram Memory - ADocLib
It's a prefork worker model ported from Ruby's Unicorn project. The Gunicorn server is. [BUG] Gunicorn Workers Hangs And Consumes Memory Forever. fastapi....
Read more >
Increased memory usage of pulp-3 workers during repo sync
Cause: During repository sync, pulp-3 workers exhibit evident higher memory usage when compared to pulp-2 workers. Consequence: Satellite hits OOM or heavy ...
Read more >
Deployments Concepts - FastAPI
When building web APIs with FastAPI, if there's an error in our code, ... And if you start 4 processes (4 workers), each...
Read more >
Saving Memory on a Python Database Server
Gunicorn uses a pre-fork worker model, which means that it manages a set of worker threads that handle requests concurrently. Gunicorn allows ...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found