question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

[FR] Make the AzureML entry script accept more data types

See original GitHub issue

Thank you for submitting a feature request. Before proceeding, please review MLflow’s Issue Policy for feature requests and the MLflow Contributing Guide.

Please fill in this feature request template to ensure a timely and thorough response.

Willingness to contribute

The MLflow Community encourages new feature contributions. Would you or another member of your organization be willing to contribute an implementation of this feature (either as an MLflow Plugin or an enhancement to the MLflow code base)?

  • Yes. I can contribute this feature independently.
  • Yes. I would be willing to contribute this feature with guidance from the MLflow community.
  • No. I cannot contribute this feature at this time.

Proposal Summary

Currently, the required entry script for deployment of models onto the azureml ecosystem from mlflow is hardwired as a text string, and configurable using the mlflow.pyfunc functions for each model flavour. However, the current function to decoding JSON input data only supports Schema types associated with Pandas Dataframe tabular data, and does not for example, support the use TFServing type tensor structures which are important for image based applications.

The current string which is hard coded into the mlflow.azureml.__init__.py file is as follows:

import pandas as pd
from azureml.core.model import Model
from mlflow.pyfunc import load_model
from mlflow.pyfunc.scoring_server import parse_json_input, _get_jsonable_obj

def init():
    global model
    model_path = Model.get_model_path(model_name="{model_name}", version={model_version})
    model = load_model(model_path)

def run(json_input):
    input_df = parse_json_input(json_input=json_input, orient="split")
    return _get_jsonable_obj(model.predict(input_df), pandas_orient="records")

The parse_json_input function, which comes form the module mlflow.pyfunc.scoring_server, has only the ability to handle Pandas Dataframe tabular data, and thus does not convert input JSON content to tensor (numpy ndarray) type objects. It doesn’t support the TFServing tensor format, which would allow for decoding of these types of structures.

Motivation

  • What is the use case for this feature?

A simple extension of a use case here is the ability to tensor type inputs as defined by the Tensorspec Schema type, and support the deserialization of numpy ndarrays, such as images. This opens up the potential for any type of application which uses ndarray objects as inputs, such as CNN applications.

  • Why is this use case valuable to support for MLflow users in general?

It broadens the scope of the types of models that can be deployed on AzureML using the model agnostic infrastructure developed through the mlflow.pyfunc flavours.

  • Why is this use case valuable to support for your project(s) or organization?

Potential for many different types of models, not just Pandas dataframe tabular based data structures as input.

  • Why is it currently difficult to achieve this use case? (please be as specific as possible about why related MLflow features and components are insufficient)

The current implementation of the entry script is defined as a hard coded string in the mlflow.azureml module and cannot be overwritten or modified by any functional call. Therefore, the limitations of what data types can be deserialized by the JSON deserialization function used cannot be circumvented without changes to the said the module.

What component(s), interfaces, languages, and integrations does this feature affect?

Components

  • area/artifacts: Artifact stores and artifact logging
  • area/build: Build and test infrastructure for MLflow
  • area/docs: MLflow documentation pages
  • area/examples: Example code
  • area/model-registry: Model Registry service, APIs, and the fluent client calls for Model Registry
  • area/models: MLmodel format, model serialization/deserialization, flavors
  • area/projects: MLproject format, project running backends
  • area/scoring: MLflow Model server, model deployment tools, Spark UDFs
  • area/server-infra: MLflow Tracking server backend
  • area/tracking: Tracking Service, tracking client APIs, autologging

Interfaces

  • area/uiux: Front-end, user experience, plotting, JavaScript, JavaScript dev server
  • area/docker: Docker use across MLflow’s components, such as MLflow Projects and MLflow Models
  • area/sqlalchemy: Use of SQLAlchemy in the Tracking Service or Model Registry
  • area/windows: Windows support

Languages

  • language/r: R APIs and clients
  • language/java: Java APIs and clients
  • language/new: Proposals for new client languages

Integrations

  • integrations/azure: Azure and Azure ML integrations
  • integrations/sagemaker: SageMaker integrations
  • integrations/databricks: Databricks integrations

Details

(Use this section to include any additional information about the feature. If you have a proposal for how to implement this feature, please include it here. For implementation guidelines, please refer to the Contributing Guide.)

One potential solution to the problem is to use existing functionality that has been developed for other deployment types in the mlflow package. A number of JSON deserialization functions are contained within the mlflow.pyfunc.scoring_serving module which do have the ability to decode TFServing type tensor formats, and convert these to numpy ndarrays from the specified mlflow.types.Schema.

If an input of a numpy ndarray representing an image (e.g. shape of image is (3,800,600) array) is encapsulated and serialized in the following way (using the TFServing definintion):

payload = {
            'instances' : [
                image.tolist()
                ]
        }
payload = str.encode(json.dumps(payload)))

Then using the mlflow.pyfunc.scoring_server module function infer_and_parse_json_input() in the entry script does result in successful decoding of the above JSON serialized numpy ndarray back to the correct sized object, which can then be passed to loaded model function for inference.

A notional modification to the entry script could potentially be as simple as:

import pandas as pd
from azureml.core.model import Model
from mlflow.pyfunc import load_model
from mlflow.pyfunc.scoring_server import infer_and_parse_json_input, _get_jsonable_obj

def init():
    global model
    model_path = Model.get_model_path(model_name="{model_name}", version={model_version})
    model = load_model(model_path)

def run(json_input):
    input = infer_and_parse_json_input(json_input=json_input, orient="split")
    return _get_jsonable_obj(model.predict(input), pandas_orient="records")

Local testing of this functionality appears to be successful in decoding the JSON serialized image back to a numpy ndarray and was successfully accepted by a PyTorch ONNX model that takes numpy ndarray image representations as input.

There are a few issues to iron out, including the additional dimension returned by the infer_and_parse_json_input() function, that is a shape of (1,3,nx,ny) instead of the transmitted (3,nx,ny). A simple numpy.squeeze() application solves this issue, however, input may not be only be a numpy ndarray so some digging into the infer_and_parse_json_input() function is required to see why.

Issue Analytics

  • State:open
  • Created 2 years ago
  • Comments:5 (4 by maintainers)

github_iconTop GitHub Comments

1reaction
BenWilson2commented, Feb 3, 2022

Hi @ecm200 we’re getting a few people to take a look at the design implications of this and hope to have an evaluation on the feasibility of this in the next few sprints. Thank you for the idea and I’ll keep you posted on what the team comes back with!

0reactions
santiagxfcommented, Jul 26, 2022

HI @ecm200! Just a quick update on this thread, the scenario you mentioned is currently supported in the last version of our integration with MLflow. Actually, in June we introduced a lot of improvements to the integration, along with brand new documentation and samples.

Read more comments on GitHub >

github_iconTop Results From Across the Web

Advanced entry script authoring - Azure - Microsoft Learn
Learn how to write Azure Machine Learning entry scripts for pre- and post-processing during deployment.
Read more >
Deploy and Serve Model from Azure Databricks onto Azure ...
We demonstrate how to deploy a PySpark based Multi-class classification model trained on Azure Databricks using Azure Machine Learning (AML) onto Azure ...
Read more >
Data | Azure Machine Learning
Connect to, or create, a datastore backed by one of the multiple data-storage options that Azure provides. For example: Azure Blob Container; Azure...
Read more >
3 Ways to Pass Data Between Azure ML Pipeline Steps
Even if both steps use that dataset as an input, they're bound to the same specific version. If one step updates the dataset,...
Read more >
ML Pipelines in Azure Machine Learning the right way - Medium
Azure ML Studio (AML) is an Azure service for data scientists to build, ... at all the possible input types a PythonScriptStep can...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found