question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

[BUG] Bug in extra added code paths when getting code directories from model

See original GitHub issue

Thank you for submitting an issue. Please refer to our issue policy for additional information about bug reports. For help with debugging your code, please refer to Stack Overflow.

Please fill in this bug report template to ensure a timely and thorough response.

Willingness to contribute

The MLflow Community encourages bug fix contributions. Would you or another member of your organization be willing to contribute a fix for this bug to the MLflow code base?

  • Yes. I can contribute a fix for this bug independently.
  • Yes. I would be willing to contribute a fix for this bug with guidance from the MLflow community.
  • No. I cannot contribute a bug fix at this time.

System information

  • Have I written custom code (as opposed to using a stock example script provided in MLflow): Yes
  • OS Platform and Distribution (e.g., Linux Ubuntu 16.04): MacOS Catalina
  • MLflow installed from (source or binary): Installed wtih pip
  • MLflow version (run mlflow --version): 1.25.1 (happens on 1.24.0 as well)
  • Python version: 3.9.5
  • npm version, if running the dev UI: None
  • Exact command to reproduce: Any load_model for a model that was saved with code_paths arguments would suffice:
mlflow.sklearn.save_model(model, model_path, serialization_format="cloudpickle", code_paths=extra_code_paths). # this parameter of code_paths is only available in 1.25.0 forward
mlflow.sklearn.load_model(model_path)

Describe the problem

A model saved with extra paths can be used to supply local code that is not available as part of a public library (such as an inner library), which is supplied when saving the model using the code_paths argument. This causes the actual directories to be copied into the mlflow model directory under a specified code directory, and added to the relevant flavor configurations as well for future loading. When loading the model, these paths inside the code directory are added to the system path so that imports of these files would work OOTB. Unfortunately, the code that is in charge of retrieving the relevant code directories under code has a small bug which causes this not to work. For example, if my sklearn model depends (for example, as a preprocessor) on a a local file inside a private library, let’s assume its path is it private_module.private_code_file, I would add the directory that contains this private_module as a code path so that this import would work when loading the model. However, I’m receiving a ModuleNotFoundError: No module named 'private_module' error instead.

This is because the relevant function _get_code_dirs has a small bug in this line:

return [
    (os.path.join(dst_code_path, x))
    for x in os.listdir(src_code_path)
    if os.path.isdir(x) and not x == "__pycache__"
]

Specifically, the error is at

if os.path.isdir(x) and not x == "__pycache__"

Which checks that the x variable is a directory, which is simply a string of a directory under the code directory. Instead, we should check that the full path is a directory:

if os.path.isdir(os.path.join(src_code_path, x)) and not x == "__pycache__"

Code to reproduce issue

mlflow.sklearn.save_model(model, model_path, serialization_format="cloudpickle", code_paths=["path/to/directory/containing/private_module"). # this parameter of code_paths is only available in 1.25.0 forward
mlflow.sklearn.load_model(model_path)

Other info / logs

This happens any time a model is trying to be loaded -

  • When loading the model using load_model(model_path)
  • When trying to serve it using mlflow models serve <model_path>
  • When testing it runs in a sagemaker container (build using mlflow sagemaker build-and-push-container --no-push) and then running mlflow sagemaker run-local <model_path>`

What component(s), interfaces, languages, and integrations does this bug affect?

Components

  • area/artifacts: Artifact stores and artifact logging
  • area/build: Build and test infrastructure for MLflow
  • area/docs: MLflow documentation pages
  • area/examples: Example code
  • area/model-registry: Model Registry service, APIs, and the fluent client calls for Model Registry
  • area/models: MLmodel format, model serialization/deserialization, flavors
  • area/projects: MLproject format, project running backends
  • area/scoring: MLflow Model server, model deployment tools, Spark UDFs
  • area/server-infra: MLflow Tracking server backend
  • area/tracking: Tracking Service, tracking client APIs, autologging

Interface

  • area/uiux: Front-end, user experience, plotting, JavaScript, JavaScript dev server
  • area/docker: Docker use across MLflow’s components, such as MLflow Projects and MLflow Models
  • area/sqlalchemy: Use of SQLAlchemy in the Tracking Service or Model Registry
  • area/windows: Windows support

Language

  • language/r: R APIs and clients
  • language/java: Java APIs and clients
  • language/new: Proposals for new client languages

Integrations

  • integrations/azure: Azure and Azure ML integrations
  • integrations/sagemaker: SageMaker integrations
  • integrations/databricks: Databricks integrations

Issue Analytics

  • State:closed
  • Created a year ago
  • Comments:7 (2 by maintainers)

github_iconTop GitHub Comments

1reaction
dbczumarcommented, Apr 29, 2022

@zionsofer Thank you for pointing this out and helping us root cause the issue. We’re working on a fix right now.

0reactions
zionsofercommented, Nov 2, 2022

@jhallard I’ll admit I haven’t been working with this of MLFlow lately, but IIRC it’s because that the directories that are copied could be the parents of such libraries, then you would have to include those as well.

You should ask the maintainers though, as your question is indeed valid

Read more comments on GitHub >

github_iconTop Results From Across the Web

python.analysis.extraPaths setting supported? #29 - GitHub
I've tried adding the path to my vendor/pydeps/pypi_beautifulsoup4 directory to "python.analysis.extraPaths" but I get the same error and can't ...
Read more >
Path Traversal | OWASP Foundation
A path traversal attack (also known as directory traversal) aims to access files and directories that are stored outside the web root folder....
Read more >
CWE-22: Improper Limitation of a Pathname to a Restricted ...
The following code takes untrusted input and uses a regular expression to filter "../" from the input. It then appends this result to...
Read more >
Settings Reference for Python - Visual Studio Code
Specifies a path to a directory that contains custom type stubs. Each package's type stub file(s) are expected to be in its own...
Read more >
Could not determine generated file paths for CoreData code ...
As I selected Code generation language as Objective-C. Error Disappeared. ... Add the model again to whatever folder you want to add it...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found