Support multi-model serving and container sharing
See original GitHub issue/kind feature
Describe the solution you’d like
Motivation
There are growing use cases of developing per-user or per-category ML models instead of cohort model. For example a news classification service trains custom model based on each news category, a recommendation model trains on each user’s usage history to personalize the recommendation. While you get the benefit of better inference accuracy by building models for each use case, the cost of deploying models increase significantly because you may train anywhere from hundreds to thousands of custom models, and it becomes difficult to manage so many models on production.
Currently KFServing single model - single service model does not scale well when deploying hundreds and thousands of models. It is quite significant cost to deploy so many model services both on cloud and on-prem kubernetes cluster. While triton inference server already supports model repository and allows inference for multi models with the same endpoint, we’d like to extend this to other ML frameworks like sklearn, xgboost, pytorch etc for cost effective container sharing and simplify & unify the user experience to deploy multiple models.
Proposal
This is high level idea and will follow up with more detailed proposal
- Create shared multi model inference service
apiVersion: "serving.kubeflow.org/v1alpha2"
kind: "InferenceService"
metadata:
name: "multi-model-sample"
spec:
default:
predictor:
sklearn:
multiModel: Manual
resources:
limits:
cpu: 4
memory: 10Gi
requests:
cpu: 4
memory: 10Gi
The created pod mounts a config map volume with multi model config
apiVersion: v1
data:
model_config_list.conf: |-
model_config_list:{}
kind: ConfigMap
metadata:
name: multi-model-sample-default-config
The containers starts with following options for tensorflow, we can add similar options for other model servers
- --model_config_file=/mnt/models/model_config_list.conf
- --model_config_file_poll_wait_seconds=60
- Deploy
news-oil
onto the multi model inference service by adding the annotation, this step does not actually create a new inference service, rather the KFServing controller updates the configmap of the multi model inference service withnews-oil
’s storage uri so that multi model inference service downloadsnews-oil
intomulti-model-sample
’s service memory which will be available for user to call from the inference endpoint with the model name.
apiVersion: "serving.kubeflow.org/v1alpha2"
kind: "InferenceService"
metadata:
name: "news-oil"
annotations:
serving.kubeflow.org/multi-model: mutli-model-sample
spec:
default:
predictor:
sklearn:
multiModel: MANUAL // Defaults to NONE, also can be set to MANUAL or AUTO
storageUri: gs://kfserving-samples/sklearn/news-oil-1
resources:
limits:
cpu: 1
memory: 1Gi
requests:
cpu: 1
memory: 1Gi
After this is deployed, it adds following model config to the multi-model-sample
inference service’s config map
apiVersion: v1
data:
model_config_list.conf: |-
model_config_list: {
config: {
name: "news-oil",
base_path: "gs://kfserving-samples/models/sklearn/news-oil-1"
model_platform: "sklearn",
}
}
kind: ConfigMap
metadata:
name: multi-model-sample-default-config
The model then should be loaded onto multi model inference service after polling period and user can curl from the endpoint /v1/models/news-oil:predict
- User can deploy more similar models onto the multi model inference service, the controller validates that these models should be using the same ML framework and the accumulated memory limit should be less than the specified multi model server memory limit.
apiVersion: "serving.kubeflow.org/v1alpha2"
kind: "InferenceService"
metadata:
name: "news-sports"
annotations:
serving.kubeflow.org/multi-model: mutli-model-sample
spec:
default:
predictor:
sklearn:
multiModel: MANUAL // Defaults to NONE, also can be set to MANUAL or AUTO
storageUri: gs://kfserving-samples/sklearn/news-sports-1
resources:
limits:
cpu: 1
memory: 1Gi
requests:
cpu: 1
memory: 1Gi
- When user deletes the model, the multi model inference service then unloads the model from memory.
In a second phase KFServing can implement a smarter scheduler which can automatically provision multi model inference services as user deploy more and more models, virtual service can be setup to route to the right shared service which host the model.
Anything else you would like to add: [Miscellaneous information that will assist in solving the issue.]
Issue Analytics
- State:
- Created 3 years ago
- Reactions:1
- Comments:37 (14 by maintainers)
Top GitHub Comments
You can check https://github.com/kubeflow/kfserving/blob/master/docs/MULTIMODELSERVING_GUIDE.md#integration-with-model-servers to understand which model servers are integrated with multi-model serving.
Yea, you don’t need to redeploy InferenceService. KFServing control plane will load models dynamically as long the model server you use support loading model dynamically.