question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

[BUG] Cannot load/serve spark model with custom transformer

See original GitHub issue

Willingness to contribute

Yes. I would be willing to contribute a fix for this bug with guidance from the MLflow community.

MLflow version

1.27.0 (also in 1.26.1)

System information

Describe the problem

I have a custom transformer class (SQLTokenizer) written in spark and run into issues when trying to load/serve when this transformer is part of the spark model pipeline. I believe I should be using the code_paths parameter in mlflow.spark.log_model() or mlflow.spark.save_model() to save custom transformer code with the model (https://www.mlflow.org/docs/1.27.0/python_api/mlflow.spark.html#mlflow.spark.log_model). I should note that the documentation lists this as a parameter but there is no definition on the API page as to what this parameter is.

When passing in the code_paths parameter, I am able to see my custom code packaged inside of a ‘code’ directory inside of the model artifact directory (artifact structure shown below).

Subsequently, when I open a new notebook and try to load the model, I get the following error message: “FileNotFoundError: [Errno 2] No such file or directory: ‘/tmp/tmp7shd760i/sparkml/code’”

SideNote: Is there a Databricks Runtime + mlflow version that this capability is known to work?

Tracking information

No response

Code to reproduce issue

Minimal code to have a single transformer in spark pipeline that can also be served

from pyspark.ml import Pipeline
import mlflow

# Define SQLTokenizer Class inline, omitted for space but can be added if needed

sql_tokenizer = SQLTokenizer(inputCol="merged_query", outputCol="prediction")
sql_pipeline = Pipeline(stages=[sql_tokenizer])
sql_model = sql_pipeline.fit(data)
mlflow.spark.log_model(sql_model, 'sql_model', code_paths=['/dbfs/path/to/sql_tokenizer.py'])

Saved model structure from mlflow.spark.log_model()

|-sql_model
    |-code
        |-sql_tokenizer.py
    |-sparkml
        |-metadata
            |-SUCCESS
            |-part-00000
        |-stages
            |-0_SQLTokenizer_182199978fae
                |-metadata
                    |-SUCCESS
                    |-part-00000
    |-MLmodel
    |-conda.yaml
    |-python_env.yaml
    |-requirements.txt

Subsequently try to load the model from other notebook

import mlflow

model_path = 'runs:/6921219571d64a4dab2c0160dbca63db/sql_model'
custom_layer = mlflow.spark.load_model(model_path)

Stack trace

---------------------------------------------------------------------------
FileNotFoundError                         Traceback (most recent call last)
<command-3019178> in <cell line: 5>()
      3 
      4 model_path = 'runs:/6921219571d64a4dab2c0160dbca63db/sql_model'
----> 5 custom_layer = mlflow.spark.load_model(model_path)
      6 
      7 display(custom_layer.transform(merged_data))

/databricks/python/lib/python3.9/site-packages/mlflow/spark.py in load_model(model_uri, dfs_tmpdir)
    707     model_uri = append_to_uri_path(model_uri, flavor_conf["model_data"])
    708     local_model_path = _download_artifact_from_uri(model_uri)
--> 709     _add_code_from_conf_to_system_path(local_model_path, flavor_conf)
    710 
    711     return _load_model(model_uri=model_uri, dfs_tmpdir_base=dfs_tmpdir)

/databricks/python/lib/python3.9/site-packages/mlflow/utils/model_utils.py in _add_code_from_conf_to_system_path(local_path, conf, code_key)
    153     if code_key in conf and conf[code_key]:
    154         code_path = os.path.join(local_path, conf[code_key])
--> 155         _add_code_to_system_path(code_path)

/databricks/python/lib/python3.9/site-packages/mlflow/utils/model_utils.py in _add_code_to_system_path(code_path)
    119 
    120 def _add_code_to_system_path(code_path):
--> 121     sys.path = [code_path] + _get_code_dirs(code_path) + sys.path
    122     # Delete cached modules so they will get reloaded anew from the correct code path
    123     # Otherwise python will use the cached modules

/databricks/python/lib/python3.9/site-packages/mlflow/utils/model_utils.py in _get_code_dirs(src_code_path, dst_code_path)
     87     return [
     88         (os.path.join(dst_code_path, x))
---> 89         for x in os.listdir(src_code_path)
     90         if os.path.isdir(os.path.join(src_code_path, x)) and not x == "__pycache__"
     91     ]

FileNotFoundError: [Errno 2] No such file or directory: '/tmp/tmp7shd760i/sparkml/code'

Inspection of /tmp/tmp7sh760i/

|-tmp7shd760i
    |-sparkml
        |-metadata
            |-SUCCESS
            |-part-00000
        |-stages
            |-0_SQLTokenizer_182199978fae
                |-metadata
                    |-SUCCESS
                    |-part-00000

it appears that the custom python code is missing from this /tmp directory

Other info / logs

No response

What component(s) does this bug affect?

  • area/artifacts: Artifact stores and artifact logging
  • area/build: Build and test infrastructure for MLflow
  • area/docs: MLflow documentation pages
  • area/examples: Example code
  • area/model-registry: Model Registry service, APIs, and the fluent client calls for Model Registry
  • area/models: MLmodel format, model serialization/deserialization, flavors
  • area/pipelines: Pipelines, Pipeline APIs, Pipeline configs, Pipeline Templates
  • area/projects: MLproject format, project running backends
  • area/scoring: MLflow Model server, model deployment tools, Spark UDFs
  • area/server-infra: MLflow Tracking server backend
  • area/tracking: Tracking Service, tracking client APIs, autologging

What interface(s) does this bug affect?

  • area/uiux: Front-end, user experience, plotting, JavaScript, JavaScript dev server
  • area/docker: Docker use across MLflow’s components, such as MLflow Projects and MLflow Models
  • area/sqlalchemy: Use of SQLAlchemy in the Tracking Service or Model Registry
  • area/windows: Windows support

What language(s) does this bug affect?

  • language/r: R APIs and clients
  • language/java: Java APIs and clients
  • language/new: Proposals for new client languages

What integration(s) does this bug affect?

  • integrations/azure: Azure and Azure ML integrations
  • integrations/sagemaker: SageMaker integrations
  • integrations/databricks: Databricks integrations

Issue Analytics

  • State:closed
  • Created a year ago
  • Comments:5

github_iconTop GitHub Comments

1reaction
dbczumarcommented, Sep 10, 2022

@mlflow-automation @EPond89 Still working on this. I’ll have a fix out shortly. Apologies for the delay.

0reactions
dbczumarcommented, Oct 4, 2022

Hi @EPond89 , I’ve filed https://github.com/mlflow/mlflow/pull/6968 to address this issue. Apologies for the delay.

Read more comments on GitHub >

github_iconTop Results From Across the Web

Cannot read a Pipeline model with custom unary ...
When I save and load just the Unary Transformer then it works fine. Sample code to reproduce the error (tested in Spark 2.2):...
Read more >
ModelMesh Overview - KServe Documentation Website
ModelMesh Serving makes use of two core Kubernetes Custom Resource types: ServingRuntime - Templates for Pods that can serve one or more particular...
Read more >
In a linked custom transformer, $(FME_MF_DIR) is resolved to ...
I have to use a PythonCaller in the custom transformer to get the value resolved to the path of ... Is this as...
Read more >
A Deep Dive into Custom Spark Transformers for ML ...
Without a pipeline, each transformer and model may need to be saved separately, and the order of transformation must be manually preserved.
Read more >
Spark Pipelines: Elegant Yet Powerful | by Insight
Typically during the exploratory stages of a machine learning problem, ... difficult to extend the Transformer class and create our own custom transformers....
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found