question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Support multi-model serving and container sharing

See original GitHub issue

/kind feature

Describe the solution you’d like

Motivation

There are growing use cases of developing per-user or per-category ML models instead of cohort model. For example a news classification service trains custom model based on each news category, a recommendation model trains on each user’s usage history to personalize the recommendation. While you get the benefit of better inference accuracy by building models for each use case, the cost of deploying models increase significantly because you may train anywhere from hundreds to thousands of custom models, and it becomes difficult to manage so many models on production.

Currently KFServing single model - single service model does not scale well when deploying hundreds and thousands of models. It is quite significant cost to deploy so many model services both on cloud and on-prem kubernetes cluster. While triton inference server already supports model repository and allows inference for multi models with the same endpoint, we’d like to extend this to other ML frameworks like sklearn, xgboost, pytorch etc for cost effective container sharing and simplify & unify the user experience to deploy multiple models.

Proposal

This is high level idea and will follow up with more detailed proposal

  1. Create shared multi model inference service
apiVersion: "serving.kubeflow.org/v1alpha2"
kind: "InferenceService"
metadata:
  name: "multi-model-sample"
spec:
  default:
    predictor:
      sklearn:
        multiModel: Manual
        resources:
          limits:
            cpu: 4
            memory: 10Gi
          requests:
            cpu: 4
            memory: 10Gi

The created pod mounts a config map volume with multi model config

apiVersion: v1
data:
  model_config_list.conf: |-
     model_config_list:{}
kind: ConfigMap
metadata:
  name: multi-model-sample-default-config

The containers starts with following options for tensorflow, we can add similar options for other model servers

 - --model_config_file=/mnt/models/model_config_list.conf
 - --model_config_file_poll_wait_seconds=60
  1. Deploy news-oil onto the multi model inference service by adding the annotation, this step does not actually create a new inference service, rather the KFServing controller updates the configmap of the multi model inference service with news-oil’s storage uri so that multi model inference service downloads news-oil into multi-model-sample’s service memory which will be available for user to call from the inference endpoint with the model name.
apiVersion: "serving.kubeflow.org/v1alpha2"
kind: "InferenceService"
metadata:
  name: "news-oil"
  annotations:
      serving.kubeflow.org/multi-model: mutli-model-sample 
spec:
  default:
    predictor:
      sklearn:
        multiModel:  MANUAL // Defaults to NONE, also can be set to MANUAL or AUTO
        storageUri: gs://kfserving-samples/sklearn/news-oil-1
        resources:
          limits:
            cpu: 1
            memory: 1Gi
          requests:
            cpu: 1
            memory: 1Gi

After this is deployed, it adds following model config to the multi-model-sample inference service’s config map

apiVersion: v1
data:
  model_config_list.conf: |-
     model_config_list: {
           config: {
               name: "news-oil",
               base_path: "gs://kfserving-samples/models/sklearn/news-oil-1"
               model_platform: "sklearn",
           }
     }
kind: ConfigMap
metadata:
  name: multi-model-sample-default-config

The model then should be loaded onto multi model inference service after polling period and user can curl from the endpoint /v1/models/news-oil:predict

  1. User can deploy more similar models onto the multi model inference service, the controller validates that these models should be using the same ML framework and the accumulated memory limit should be less than the specified multi model server memory limit.
apiVersion: "serving.kubeflow.org/v1alpha2"
kind: "InferenceService"
metadata:
  name: "news-sports"
  annotations:
      serving.kubeflow.org/multi-model: mutli-model-sample 
spec:
  default:
    predictor:
      sklearn:
        multiModel:  MANUAL // Defaults to NONE, also can be set to MANUAL or AUTO
        storageUri: gs://kfserving-samples/sklearn/news-sports-1
        resources:
          limits:
            cpu: 1
            memory: 1Gi
          requests:
            cpu: 1
            memory: 1Gi
  1. When user deletes the model, the multi model inference service then unloads the model from memory.

In a second phase KFServing can implement a smarter scheduler which can automatically provision multi model inference services as user deploy more and more models, virtual service can be setup to route to the right shared service which host the model.

Anything else you would like to add: [Miscellaneous information that will assist in solving the issue.]

Issue Analytics

  • State:closed
  • Created 3 years ago
  • Reactions:1
  • Comments:37 (14 by maintainers)

github_iconTop GitHub Comments

1reaction
yuzliucommented, May 10, 2021

@yuzliu Awesome! Thank you for the quick response. Just to confirm, this also allows adding models without redeploying?

Yea, you don’t need to redeploy InferenceService. KFServing control plane will load models dynamically as long the model server you use support loading model dynamically.

You can check https://github.com/kubeflow/kfserving/blob/master/docs/MULTIMODELSERVING_GUIDE.md#integration-with-model-servers to understand which model servers are integrated with multi-model serving.

1reaction
yuzliucommented, May 10, 2021

@yuzliu Awesome! Thank you for the quick response. Just to confirm, this also allows adding models without redeploying?

Yea, you don’t need to redeploy InferenceService. KFServing control plane will load models dynamically as long the model server you use support loading model dynamically.

Read more comments on GitHub >

github_iconTop Results From Across the Web

Host multiple models in one container behind one endpoint
They use the same fleet of resources and a shared serving container to host all of ... Multi-model endpoints support hosting both CPU...
Read more >
SageMaker Multi-Model vs Multi-Container Endpoints
Multi-Model Endpoints help you scale thousands of models into one endpoint. By using a shared serving container, you can host multiple ...
Read more >
Amazon SageMaker Multi-Model Endpoints using your own ...
For the inference container to serve multiple models in a multi-model endpoint, ... into the container, so state in the handler is not...
Read more >
Deploy multiple serving containers on a single instance ...
SageMaker already supports deploying thousands of ML models and serving them using a single container and endpoint with multi-model ...
Read more >
Co-hosting models on the Vertex AI prediction service
Learn about resource sharing among multiple model deployments on Vertex AI for improved utilization of memory and computational resources.
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found