Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Support multi-model serving and container sharing

See original GitHub issue

/kind feature

Describe the solution you’d like

Motivation

There are growing use cases of developing per-user or per-category ML models instead of cohort model. For example a news classification service trains custom model based on each news category, a recommendation model trains on each user’s usage history to personalize the recommendation. While you get the benefit of better inference accuracy by building models for each use case, the cost of deploying models increase significantly because you may train anywhere from hundreds to thousands of custom models, and it becomes difficult to manage so many models on production.

Currently KFServing single model - single service model does not scale well when deploying hundreds and thousands of models. It is quite significant cost to deploy so many model services both on cloud and on-prem kubernetes cluster. While triton inference server already supports model repository and allows inference for multi models with the same endpoint, we’d like to extend this to other ML frameworks like sklearn, xgboost, pytorch etc for cost effective container sharing and simplify & unify the user experience to deploy multiple models.

Proposal

This is high level idea and will follow up with more detailed proposal

Create shared multi model inference service

apiVersion: "serving.kubeflow.org/v1alpha2"
kind: "InferenceService"
metadata:
  name: "multi-model-sample"
spec:
  default:
    predictor:
      sklearn:
        multiModel: Manual
        resources:
          limits:
            cpu: 4
            memory: 10Gi
          requests:
            cpu: 4
            memory: 10Gi

The created pod mounts a config map volume with multi model config

apiVersion: v1
data:
  model_config_list.conf: |-
     model_config_list:{}
kind: ConfigMap
metadata:
  name: multi-model-sample-default-config

The containers starts with following options for tensorflow, we can add similar options for other model servers

 - --model_config_file=/mnt/models/model_config_list.conf
 - --model_config_file_poll_wait_seconds=60

Deploy news-oil onto the multi model inference service by adding the annotation, this step does not actually create a new inference service, rather the KFServing controller updates the configmap of the multi model inference service with news-oil’s storage uri so that multi model inference service downloads news-oil into multi-model-sample’s service memory which will be available for user to call from the inference endpoint with the model name.

apiVersion: "serving.kubeflow.org/v1alpha2"
kind: "InferenceService"
metadata:
  name: "news-oil"
  annotations:
      serving.kubeflow.org/multi-model: mutli-model-sample 
spec:
  default:
    predictor:
      sklearn:
        multiModel:  MANUAL // Defaults to NONE, also can be set to MANUAL or AUTO
        storageUri: gs://kfserving-samples/sklearn/news-oil-1
        resources:
          limits:
            cpu: 1
            memory: 1Gi
          requests:
            cpu: 1
            memory: 1Gi

After this is deployed, it adds following model config to the multi-model-sample inference service’s config map

apiVersion: v1
data:
  model_config_list.conf: |-
     model_config_list: {
           config: {
               name: "news-oil",
               base_path: "gs://kfserving-samples/models/sklearn/news-oil-1"
               model_platform: "sklearn",
           }
     }
kind: ConfigMap
metadata:
  name: multi-model-sample-default-config

The model then should be loaded onto multi model inference service after polling period and user can curl from the endpoint /v1/models/news-oil:predict

User can deploy more similar models onto the multi model inference service, the controller validates that these models should be using the same ML framework and the accumulated memory limit should be less than the specified multi model server memory limit.

apiVersion: "serving.kubeflow.org/v1alpha2"
kind: "InferenceService"
metadata:
  name: "news-sports"
  annotations:
      serving.kubeflow.org/multi-model: mutli-model-sample 
spec:
  default:
    predictor:
      sklearn:
        multiModel:  MANUAL // Defaults to NONE, also can be set to MANUAL or AUTO
        storageUri: gs://kfserving-samples/sklearn/news-sports-1
        resources:
          limits:
            cpu: 1
            memory: 1Gi
          requests:
            cpu: 1
            memory: 1Gi

When user deletes the model, the multi model inference service then unloads the model from memory.

In a second phase KFServing can implement a smarter scheduler which can automatically provision multi model inference services as user deploy more and more models, virtual service can be setup to route to the right shared service which host the model.

Anything else you would like to add: [Miscellaneous information that will assist in solving the issue.]

Issue Analytics

State:
Created 3 years ago
Reactions:1
Comments:37 (14 by maintainers)

Top GitHub Comments

1reaction

yuzliucommented, May 10, 2021

@yuzliu Awesome! Thank you for the quick response. Just to confirm, this also allows adding models without redeploying?

Yea, you don’t need to redeploy InferenceService. KFServing control plane will load models dynamically as long the model server you use support loading model dynamically.

You can check https://github.com/kubeflow/kfserving/blob/master/docs/MULTIMODELSERVING_GUIDE.md#integration-with-model-servers to understand which model servers are integrated with multi-model serving.

1reaction

yuzliucommented, May 10, 2021

@yuzliu Awesome! Thank you for the quick response. Just to confirm, this also allows adding models without redeploying?

Yea, you don’t need to redeploy InferenceService. KFServing control plane will load models dynamically as long the model server you use support loading model dynamically.