Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

[BUG] Cannot load/serve spark model with custom transformer

See original GitHub issue

Willingness to contribute

Yes. I would be willing to contribute a fix for this bug with guidance from the MLflow community.

MLflow version

1.27.0 (also in 1.26.1)

System information

OS Platform and Distribution (e.g., Linux Ubuntu 16.04): Ubuntu 20.04.4 LTS (Databricks 11.1 ML Runtime)
Python version: 3.9.5

Describe the problem

I have a custom transformer class (SQLTokenizer) written in spark and run into issues when trying to load/serve when this transformer is part of the spark model pipeline. I believe I should be using the code_paths parameter in mlflow.spark.log_model() or mlflow.spark.save_model() to save custom transformer code with the model (https://www.mlflow.org/docs/1.27.0/python_api/mlflow.spark.html#mlflow.spark.log_model). I should note that the documentation lists this as a parameter but there is no definition on the API page as to what this parameter is.

When passing in the code_paths parameter, I am able to see my custom code packaged inside of a ‘code’ directory inside of the model artifact directory (artifact structure shown below).

Subsequently, when I open a new notebook and try to load the model, I get the following error message: “FileNotFoundError: [Errno 2] No such file or directory: ‘/tmp/tmp7shd760i/sparkml/code’”

SideNote: Is there a Databricks Runtime + mlflow version that this capability is known to work?

Tracking information

No response

Code to reproduce issue

Minimal code to have a single transformer in spark pipeline that can also be served

from pyspark.ml import Pipeline
import mlflow

# Define SQLTokenizer Class inline, omitted for space but can be added if needed

sql_tokenizer = SQLTokenizer(inputCol="merged_query", outputCol="prediction")
sql_pipeline = Pipeline(stages=[sql_tokenizer])
sql_model = sql_pipeline.fit(data)
mlflow.spark.log_model(sql_model, 'sql_model', code_paths=['/dbfs/path/to/sql_tokenizer.py'])

Saved model structure from mlflow.spark.log_model()

|-sql_model
    |-code
        |-sql_tokenizer.py
    |-sparkml
        |-metadata
            |-SUCCESS
            |-part-00000
        |-stages
            |-0_SQLTokenizer_182199978fae
                |-metadata
                    |-SUCCESS
                    |-part-00000
    |-MLmodel
    |-conda.yaml
    |-python_env.yaml
    |-requirements.txt

Subsequently try to load the model from other notebook

import mlflow

model_path = 'runs:/6921219571d64a4dab2c0160dbca63db/sql_model'
custom_layer = mlflow.spark.load_model(model_path)

Stack trace

---------------------------------------------------------------------------
FileNotFoundError                         Traceback (most recent call last)
<command-3019178> in <cell line: 5>()
      3 
      4 model_path = 'runs:/6921219571d64a4dab2c0160dbca63db/sql_model'
----> 5 custom_layer = mlflow.spark.load_model(model_path)
      6 
      7 display(custom_layer.transform(merged_data))

/databricks/python/lib/python3.9/site-packages/mlflow/spark.py in load_model(model_uri, dfs_tmpdir)
    707     model_uri = append_to_uri_path(model_uri, flavor_conf["model_data"])
    708     local_model_path = _download_artifact_from_uri(model_uri)
--> 709     _add_code_from_conf_to_system_path(local_model_path, flavor_conf)
    710 
    711     return _load_model(model_uri=model_uri, dfs_tmpdir_base=dfs_tmpdir)

/databricks/python/lib/python3.9/site-packages/mlflow/utils/model_utils.py in _add_code_from_conf_to_system_path(local_path, conf, code_key)
    153     if code_key in conf and conf[code_key]:
    154         code_path = os.path.join(local_path, conf[code_key])
--> 155         _add_code_to_system_path(code_path)

/databricks/python/lib/python3.9/site-packages/mlflow/utils/model_utils.py in _add_code_to_system_path(code_path)
    119 
    120 def _add_code_to_system_path(code_path):
--> 121     sys.path = [code_path] + _get_code_dirs(code_path) + sys.path
    122     # Delete cached modules so they will get reloaded anew from the correct code path
    123     # Otherwise python will use the cached modules

/databricks/python/lib/python3.9/site-packages/mlflow/utils/model_utils.py in _get_code_dirs(src_code_path, dst_code_path)
     87     return [
     88         (os.path.join(dst_code_path, x))
---> 89         for x in os.listdir(src_code_path)
     90         if os.path.isdir(os.path.join(src_code_path, x)) and not x == "__pycache__"
     91     ]

FileNotFoundError: [Errno 2] No such file or directory: '/tmp/tmp7shd760i/sparkml/code'

Inspection of /tmp/tmp7sh760i/

|-tmp7shd760i
    |-sparkml
        |-metadata
            |-SUCCESS
            |-part-00000
        |-stages
            |-0_SQLTokenizer_182199978fae
                |-metadata
                    |-SUCCESS
                    |-part-00000

it appears that the custom python code is missing from this /tmp directory

Other info / logs

No response

What component(s) does this bug affect?

area/artifacts: Artifact stores and artifact logging
area/build: Build and test infrastructure for MLflow
area/docs: MLflow documentation pages
area/examples: Example code
area/model-registry: Model Registry service, APIs, and the fluent client calls for Model Registry
area/models: MLmodel format, model serialization/deserialization, flavors
area/pipelines: Pipelines, Pipeline APIs, Pipeline configs, Pipeline Templates
area/projects: MLproject format, project running backends
area/scoring: MLflow Model server, model deployment tools, Spark UDFs
area/server-infra: MLflow Tracking server backend
area/tracking: Tracking Service, tracking client APIs, autologging

What interface(s) does this bug affect?

area/uiux: Front-end, user experience, plotting, JavaScript, JavaScript dev server
area/docker: Docker use across MLflow’s components, such as MLflow Projects and MLflow Models
area/sqlalchemy: Use of SQLAlchemy in the Tracking Service or Model Registry
area/windows: Windows support

What language(s) does this bug affect?

language/r: R APIs and clients
language/java: Java APIs and clients
language/new: Proposals for new client languages

What integration(s) does this bug affect?

integrations/azure: Azure and Azure ML integrations
integrations/sagemaker: SageMaker integrations
integrations/databricks: Databricks integrations

Issue Analytics

State:
Created a year ago
Comments:5

Top GitHub Comments

1reaction

dbczumarcommented, Sep 10, 2022

@mlflow-automation @EPond89 Still working on this. I’ll have a fix out shortly. Apologies for the delay.

0reactions

dbczumarcommented, Oct 4, 2022

Hi @EPond89 , I’ve filed https://github.com/mlflow/mlflow/pull/6968 to address this issue. Apologies for the delay.

Top Results From Across the Web

Cannot read a Pipeline model with custom unary ...

When I save and load just the Unary Transformer then it works fine. Sample code to reproduce the error (tested in Spark 2.2):...

ModelMesh Overview - KServe Documentation Website

ModelMesh Serving makes use of two core Kubernetes Custom Resource types: ServingRuntime - Templates for Pods that can serve one or more particular...

In a linked custom transformer, $(FME_MF_DIR) is resolved to ...

I have to use a PythonCaller in the custom transformer to get the value resolved to the path of ... Is this as...

A Deep Dive into Custom Spark Transformers for ML ...

Without a pipeline, each transformer and model may need to be saved separately, and the order of transformation must be manually preserved.

Spark Pipelines: Elegant Yet Powerful | by Insight

Typically during the exploratory stages of a machine learning problem, ... difficult to extend the Transformer class and create our own custom transformers....