Load Test: Low performance on Kubernetes
See original GitHub issueHello everyone!
I am new to uvicorn, so I apologize if this is common knowledge.
After serving machine learning models with Flask + waitress and getting very low numbers of requests which could be handled in a second (~2), we decided to move to FastAPI and use uvicorn/gunicorn. After some development time we were able to hit 1000 requests per second our application could handle (locally). Load tests are done with Gatling (https://github.com/gatling/gatling). However this is only true if we are not testing over a certain period of time. Sending a .json file 1000 times a second over a period of 60 seconds results in a lot of closed connections (timeout). This issue can be solved by increasing the timeout parameters. Not sure if this is the recommened way to do it. If you have some advice here, we would be happy to hear.
The standard scenario we use for load testing is by sending 50 requests per second over a period of 60 seconds. Our application needs about 200ms to process the .json file and responds with the predictions made by the ML model. So in total we have 3000 requests (50 requests x 60 seconds) and the application needs about 10 minutes to process all those requests. This works out without a problem, when increasing the timeout for gunicorn. That’s if we are running the application on a Docker Container locally on our machine. Uvicorn for example doesn’t need any additional timeout to work properly.
The first tests were done locally in a Docker Image. The Docker Image is based on miniconda3 (https://hub.docker.com/r/continuumio/miniconda3) which uses the Linux distribution Debian. We have tested serving the application with uvicorn and with gunicorn using uvicorn workers:
uvicorn predict:app --backlog 8196 --host 0.0.0.0 --port 8099
gunicorn -b 0.0.0.0:8099 -k uvicorn.workers.UvicornWorker predict:app --backlog 8196 --timeout 900 --graceful-timeout 900 --keep-alive 900
As said, it works fine. The only thing we noticed when using gatling is that it doesn’t update the responses every second. Watching gatling doing the load test, you might think the application responds in batches. However watching the logs of the Docker Containers tells us it is responding all the time.
Now we have deployed the application to Kubernetes using the gunicorn command and the performance is bad. It handles only a request per second and the container gets restarted. In a Kubernetes container the application needs about 600ms to process a request. Using the same scenario for load testing as mentioned above, it is only able to respond to about 60 requests out of 3000. The rest of the requests are leading to Server HTTP errors (502, 503, 504) very fast.
Gatling report on application, if run in Kubernetes:
================================================================================
---- Global Information --------------------------------------------------------
> request count 3001 (OK=68 KO=2933 )
> min response time 7 (OK=52 KO=7 )
> max response time 55791 (OK=40113 KO=55791 )
> mean response time 8766 (OK=37806 KO=8093 )
> std deviation 12966 (OK=7942 KO=12270 )
> response time 50th percentile 9 (OK=39486 KO=9 )
> response time 75th percentile 15031 (OK=39822 KO=15030 )
> response time 95th percentile 39747 (OK=40055 KO=15050 )
> response time 99th percentile 55144 (OK=40104 KO=55159 )
> mean requests/sec 48.403 (OK=1.097 KO=47.306)
---- Response Time Distribution ------------------------------------------------
> t < 800 ms 1 ( 0%)
> 800 ms < t < 1200 ms 0 ( 0%)
> t > 1200 ms 67 ( 2%)
> failed 2933 ( 98%)
---- Errors --------------------------------------------------------------------
> status.find.is(200), but actually found 503 1693 (57,72%)
> status.find.is(200), but actually found 504 1111 (37,88%)
> status.find.is(200), but actually found 502 129 ( 4,40%)
================================================================================
Snippet of the application endpoint:
async def verify_client(token: str):
credentials_exception = HTTPException(
status_code=status.HTTP_401_UNAUTHORIZED,
detail="Could not validate credentials",
headers={"WWW-Authenticate": "Bearer"},
)
try:
return jwt.decode(token, SECRET_KEY, algorithms=[ALGORITHM], audience=AUDIENCE)
except JWTError:
raise credentials_exception
@app.post("/score", response_model=cluster_api_models.Response_Model)
async def score(request: cluster_api_models.Request_Model, token: str = Depends(oauth2_scheme)):
logger.info("Token: {0}".format(token))
await verify_client(token)
result = await do_score(request)
return result
We have searched for resources to find how we could speed up our application by changing the configuration. One resource which we will be trying out shortly is coming from this article: https://pythonspeed.com/articles/gunicorn-in-docker/ However we would be grateful to hear any advice, how we might do better.
I am sorry for this lengthy text. Hopefully it covers most of the information you need. Let me know if you need more info. Thank you in advance!
Issue Analytics
- State:
- Created 3 years ago
- Comments:9 (1 by maintainers)
Top GitHub Comments
@makarov-roman I haven’t resolved the issue yet. After some attempts, I decided to focus on other tasks. However it’s still our goal to increase the performance and improve scaleability of the application.
@euri10 is that blocking behavior intended?