[BUG] Cannot load/serve spark model with custom transformer
See original GitHub issueWillingness to contribute
Yes. I would be willing to contribute a fix for this bug with guidance from the MLflow community.
MLflow version
1.27.0 (also in 1.26.1)
System information
- OS Platform and Distribution (e.g., Linux Ubuntu 16.04): Ubuntu 20.04.4 LTS (Databricks 11.1 ML Runtime)
- Python version: 3.9.5
Describe the problem
I have a custom transformer class (SQLTokenizer) written in spark and run into issues when trying to load/serve when this transformer is part of the spark model pipeline. I believe I should be using the code_paths
parameter in mlflow.spark.log_model()
or mlflow.spark.save_model()
to save custom transformer code with the model (https://www.mlflow.org/docs/1.27.0/python_api/mlflow.spark.html#mlflow.spark.log_model). I should note that the documentation lists this as a parameter but there is no definition on the API page as to what this parameter is.
When passing in the code_paths
parameter, I am able to see my custom code packaged inside of a ‘code’ directory inside of the model artifact directory (artifact structure shown below).
Subsequently, when I open a new notebook and try to load the model, I get the following error message: “FileNotFoundError: [Errno 2] No such file or directory: ‘/tmp/tmp7shd760i/sparkml/code’”
SideNote: Is there a Databricks Runtime + mlflow version that this capability is known to work?
Tracking information
No response
Code to reproduce issue
Minimal code to have a single transformer in spark pipeline that can also be served
from pyspark.ml import Pipeline
import mlflow
# Define SQLTokenizer Class inline, omitted for space but can be added if needed
sql_tokenizer = SQLTokenizer(inputCol="merged_query", outputCol="prediction")
sql_pipeline = Pipeline(stages=[sql_tokenizer])
sql_model = sql_pipeline.fit(data)
mlflow.spark.log_model(sql_model, 'sql_model', code_paths=['/dbfs/path/to/sql_tokenizer.py'])
Saved model structure from mlflow.spark.log_model()
|-sql_model
|-code
|-sql_tokenizer.py
|-sparkml
|-metadata
|-SUCCESS
|-part-00000
|-stages
|-0_SQLTokenizer_182199978fae
|-metadata
|-SUCCESS
|-part-00000
|-MLmodel
|-conda.yaml
|-python_env.yaml
|-requirements.txt
Subsequently try to load the model from other notebook
import mlflow
model_path = 'runs:/6921219571d64a4dab2c0160dbca63db/sql_model'
custom_layer = mlflow.spark.load_model(model_path)
Stack trace
---------------------------------------------------------------------------
FileNotFoundError Traceback (most recent call last)
<command-3019178> in <cell line: 5>()
3
4 model_path = 'runs:/6921219571d64a4dab2c0160dbca63db/sql_model'
----> 5 custom_layer = mlflow.spark.load_model(model_path)
6
7 display(custom_layer.transform(merged_data))
/databricks/python/lib/python3.9/site-packages/mlflow/spark.py in load_model(model_uri, dfs_tmpdir)
707 model_uri = append_to_uri_path(model_uri, flavor_conf["model_data"])
708 local_model_path = _download_artifact_from_uri(model_uri)
--> 709 _add_code_from_conf_to_system_path(local_model_path, flavor_conf)
710
711 return _load_model(model_uri=model_uri, dfs_tmpdir_base=dfs_tmpdir)
/databricks/python/lib/python3.9/site-packages/mlflow/utils/model_utils.py in _add_code_from_conf_to_system_path(local_path, conf, code_key)
153 if code_key in conf and conf[code_key]:
154 code_path = os.path.join(local_path, conf[code_key])
--> 155 _add_code_to_system_path(code_path)
/databricks/python/lib/python3.9/site-packages/mlflow/utils/model_utils.py in _add_code_to_system_path(code_path)
119
120 def _add_code_to_system_path(code_path):
--> 121 sys.path = [code_path] + _get_code_dirs(code_path) + sys.path
122 # Delete cached modules so they will get reloaded anew from the correct code path
123 # Otherwise python will use the cached modules
/databricks/python/lib/python3.9/site-packages/mlflow/utils/model_utils.py in _get_code_dirs(src_code_path, dst_code_path)
87 return [
88 (os.path.join(dst_code_path, x))
---> 89 for x in os.listdir(src_code_path)
90 if os.path.isdir(os.path.join(src_code_path, x)) and not x == "__pycache__"
91 ]
FileNotFoundError: [Errno 2] No such file or directory: '/tmp/tmp7shd760i/sparkml/code'
Inspection of /tmp/tmp7sh760i/
|-tmp7shd760i
|-sparkml
|-metadata
|-SUCCESS
|-part-00000
|-stages
|-0_SQLTokenizer_182199978fae
|-metadata
|-SUCCESS
|-part-00000
it appears that the custom python code is missing from this /tmp directory
Other info / logs
No response
What component(s) does this bug affect?
-
area/artifacts
: Artifact stores and artifact logging -
area/build
: Build and test infrastructure for MLflow -
area/docs
: MLflow documentation pages -
area/examples
: Example code -
area/model-registry
: Model Registry service, APIs, and the fluent client calls for Model Registry -
area/models
: MLmodel format, model serialization/deserialization, flavors -
area/pipelines
: Pipelines, Pipeline APIs, Pipeline configs, Pipeline Templates -
area/projects
: MLproject format, project running backends -
area/scoring
: MLflow Model server, model deployment tools, Spark UDFs -
area/server-infra
: MLflow Tracking server backend -
area/tracking
: Tracking Service, tracking client APIs, autologging
What interface(s) does this bug affect?
-
area/uiux
: Front-end, user experience, plotting, JavaScript, JavaScript dev server -
area/docker
: Docker use across MLflow’s components, such as MLflow Projects and MLflow Models -
area/sqlalchemy
: Use of SQLAlchemy in the Tracking Service or Model Registry -
area/windows
: Windows support
What language(s) does this bug affect?
-
language/r
: R APIs and clients -
language/java
: Java APIs and clients -
language/new
: Proposals for new client languages
What integration(s) does this bug affect?
-
integrations/azure
: Azure and Azure ML integrations -
integrations/sagemaker
: SageMaker integrations -
integrations/databricks
: Databricks integrations
Issue Analytics
- State:
- Created a year ago
- Comments:5
@mlflow-automation @EPond89 Still working on this. I’ll have a fix out shortly. Apologies for the delay.
Hi @EPond89 , I’ve filed https://github.com/mlflow/mlflow/pull/6968 to address this issue. Apologies for the delay.