Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Load Test: Low performance on Kubernetes

See original GitHub issue

Hello everyone!

I am new to uvicorn, so I apologize if this is common knowledge.

After serving machine learning models with Flask + waitress and getting very low numbers of requests which could be handled in a second (~2), we decided to move to FastAPI and use uvicorn/gunicorn. After some development time we were able to hit 1000 requests per second our application could handle (locally). Load tests are done with Gatling (https://github.com/gatling/gatling). However this is only true if we are not testing over a certain period of time. Sending a .json file 1000 times a second over a period of 60 seconds results in a lot of closed connections (timeout). This issue can be solved by increasing the timeout parameters. Not sure if this is the recommened way to do it. If you have some advice here, we would be happy to hear.

The standard scenario we use for load testing is by sending 50 requests per second over a period of 60 seconds. Our application needs about 200ms to process the .json file and responds with the predictions made by the ML model. So in total we have 3000 requests (50 requests x 60 seconds) and the application needs about 10 minutes to process all those requests. This works out without a problem, when increasing the timeout for gunicorn. That’s if we are running the application on a Docker Container locally on our machine. Uvicorn for example doesn’t need any additional timeout to work properly.

The first tests were done locally in a Docker Image. The Docker Image is based on miniconda3 (https://hub.docker.com/r/continuumio/miniconda3) which uses the Linux distribution Debian. We have tested serving the application with uvicorn and with gunicorn using uvicorn workers:

uvicorn predict:app --backlog 8196 --host 0.0.0.0 --port 8099

gunicorn -b 0.0.0.0:8099 -k uvicorn.workers.UvicornWorker predict:app --backlog 8196 --timeout 900 --graceful-timeout 900 --keep-alive 900

As said, it works fine. The only thing we noticed when using gatling is that it doesn’t update the responses every second. Watching gatling doing the load test, you might think the application responds in batches. However watching the logs of the Docker Containers tells us it is responding all the time.

Now we have deployed the application to Kubernetes using the gunicorn command and the performance is bad. It handles only a request per second and the container gets restarted. In a Kubernetes container the application needs about 600ms to process a request. Using the same scenario for load testing as mentioned above, it is only able to respond to about 60 requests out of 3000. The rest of the requests are leading to Server HTTP errors (502, 503, 504) very fast.

Gatling report on application, if run in Kubernetes:

================================================================================
---- Global Information --------------------------------------------------------
> request count                                       3001 (OK=68     KO=2933  )
> min response time                                      7 (OK=52     KO=7     )
> max response time                                  55791 (OK=40113  KO=55791 )
> mean response time                                  8766 (OK=37806  KO=8093  )
> std deviation                                      12966 (OK=7942   KO=12270 )
> response time 50th percentile                          9 (OK=39486  KO=9     )
> response time 75th percentile                      15031 (OK=39822  KO=15030 )
> response time 95th percentile                      39747 (OK=40055  KO=15050 )
> response time 99th percentile                      55144 (OK=40104  KO=55159 )
> mean requests/sec                                 48.403 (OK=1.097  KO=47.306)
---- Response Time Distribution ------------------------------------------------
> t < 800 ms                                             1 (  0%)
> 800 ms < t < 1200 ms                                   0 (  0%)
> t > 1200 ms                                           67 (  2%)
> failed                                              2933 ( 98%)
---- Errors --------------------------------------------------------------------
> status.find.is(200), but actually found 503                      1693 (57,72%)
> status.find.is(200), but actually found 504                      1111 (37,88%)
> status.find.is(200), but actually found 502                       129 ( 4,40%)
================================================================================

Snippet of the application endpoint:

async def verify_client(token: str):
    credentials_exception = HTTPException(
        status_code=status.HTTP_401_UNAUTHORIZED,
        detail="Could not validate credentials",
        headers={"WWW-Authenticate": "Bearer"},
    )
    try:
        return jwt.decode(token, SECRET_KEY, algorithms=[ALGORITHM], audience=AUDIENCE)
    except JWTError:
        raise credentials_exception


@app.post("/score", response_model=cluster_api_models.Response_Model)
async def score(request: cluster_api_models.Request_Model, token: str = Depends(oauth2_scheme)):
    logger.info("Token: {0}".format(token))
    await verify_client(token)
    result = await do_score(request)
    return result

We have searched for resources to find how we could speed up our application by changing the configuration. One resource which we will be trying out shortly is coming from this article: https://pythonspeed.com/articles/gunicorn-in-docker/ However we would be grateful to hear any advice, how we might do better.

I am sorry for this lengthy text. Hopefully it covers most of the information you need. Let me know if you need more info. Thank you in advance!

Issue Analytics

State:
Created 3 years ago
Comments:9 (1 by maintainers)

Top GitHub Comments

1reaction

AFUEUcommented, Feb 24, 2021

@makarov-roman I haven’t resolved the issue yet. After some attempts, I decided to focus on other tasks. However it’s still our goal to increase the performance and improve scaleability of the application.

0reactions

dacevedo12commented, Dec 14, 2022

@euri10 is that blocking behavior intended?

Top Results From Across the Web

Kubernetes performance testing tutorial: Load test a cluster

Set up Kubernetes Metrics Server and Horizontal Pod Autoscaler to load, stress and endurance test a web application in a Kubernetes pod.

Kubernetes Load Testing | 8 Tools & Best Practices - ContainIQ

Speedscale is a managed performance and load-testing tool that allows testing of your Kubernetes applications using simulated user traffic. The most basic way ......

Top 5 Kubernetes Load-Testing Tools and How They Compare

Load testing in Kubernetes can provide you with benefits: Cohesive insight into your application's performance. Verification of the load capacity of your ...

13-Step Guide to Performance Testing in Kubernetes - DZone

Take a look at this demonstration of performance testing on Kubernetes using JMeter and Docker by developing, storing, and analyzing a ...

Kubernetes Performance Testing Made Easy - StormForge

Kubernetes Performance Testing as a Service ... Create load tests in just three minutes and scale from tens to hundreds of thousands of...