Serving ML models with multiple workers linearly adds the RAM's load.
See original GitHub issueRecently, we deployed a ML model with FastAPI, and encountered an issue.
The code looks like this.
from ocr_pipeline.model.ocr_wrapper import OcrWrapper
ocr_wrapper = OcrWrapper(**config.model_load_params) # loads 1.5 GB PyTorch model
...
@api.post('/')
async def predict(file: UploadFile = File(...)):
preds = ocr_wrapper.predict(file.file, **config.model_predict_params)
return json.dumps({"data": preds})
The above written command consumes min. 3GB of RAM.
gunicorn --workers 2 --worker-class=uvicorn.workers.UvicornWorker app.main:api
Is there any way to scale the number of workers without consuming too much RAM?
ENVIRONMENT: Ubuntu 18.04 Python 3.6.9
fastapi==0.61.2 uvicorn==0.12.2 gunicorn==20.0.4 uvloop==0.14.0
Issue Analytics
- State:
- Created 3 years ago
- Reactions:4
- Comments:12 (5 by maintainers)
Top Results From Across the Web
Serving ML Models in Production with FastAPI and Celery
This post walks through a working example for serving a ML model using Celery and FastAPI. All code can be found in the...
Read more >Serving ML Models in Production: Common Patterns - Anyscale
Scalability: Horizontally scale across hundreds of processes or machines, while keeping the overhead in single-digit milliseconds. Multi-model ...
Read more >Serving ML Models — Ray 1.11.1
Serving ML Models ¶. This section should help you: batch requests to optimize performance. serve multiple models by composing deployments. Contents.
Read more >Best Tools to Do ML Model Serving - neptune.ai
Best Tools to Do ML Model Serving · 1. BentoML · 2. Cortex · 3. TensorFlow Serving · 4. TorchServe · 5. KFServing...
Read more >Serving Machine Learning models with Google Vertex AI
You simply need to upload your model, deploy it to an endpoint, and you're ready to go. We need to define one of...
Read more >
Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free
Top Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
This is not a specific fastAPI question (more a gunicorn one) , it’s about sharing memory between process
The solution would be loading the model in ram before the fork of the workers (of gunicorn)
so you need to use –preload
your main.py file inside folder app
If you have more question about gunicorn or python or fork or copy-on-write or python reference counting or memory leak -> stackoverflow
YOU can very probably CLOSE this issue , thank you 😃
Just found out that if I change my app methods from:
to:
removing the
async
qualifier, the model does indeed work as expected.@sevakharutyunyan are you able to verify if this works for you?