Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Use mlflow for better versioning and collaboration

See original GitHub issue

TL;DR : The plugin is in active development here and is available on PyPI. It already works reliably with kedro>=0.16.0 but is slightly different (and much more complete) of what is described in below issue. Feel free to try it out and give feedback. The plugin enforces Kedro design principles when integrating mlflow (strict separation I/O vs compute, external configuration, data abstraction, cli wrappers…) to avoid breaking the Kedro experience when using mlflow and facilitate versioning and model serving.

A huge thanks for the framework, which is really useful. My team decided to use it for most of its projects, especially to ensure collaboration. Data abstraction is really an important feature. However, we have a major disagreement of how data versioning is implemented in kedro. We decided to move on and to develop our own layer of versioning above your framework.

I’d be glad to discuss with kedro developpers about some architectures / design choices about it, and this is the goal of the issue.

Context

Versioning in machine learning is something very specific : you want to version a run, i.e. the execution of code on data with parameters. Versioning data alone is likely to be useless for reproducibility in future.

Databricks released recently mlflow which is intended to match this very goal. I think that it will be beneficial from kedro to build on top of what mlflow has already created in order to :

offer a better versioning system
facilitate collaboration between data scientists to “share” models
not reinvent the wheel

Description

The current internal versioning method in kedro does not intend to version a full “run” (code + data +parameters) which make it less useful for machine learning. Switching to mlflow for this would be a quick win to improve the framework.

Possible Implementation

My team had implemented several features :

Implement a configuration file for mlflow (mlflow.yml file in conf/base folder) which enables to parameterize all mlflow features through a conf file (autologging parameters, tracking uri, experiment where the run should be stored…) which is added to the template. This is really useful since we used a “local” mlflow server where each data scientist can experiment, and a shared one with shareable models and runs and this is nice to parameterize this through a config file.
Create an MlflowDataset class (similar to the AbstractVersionedDataset class) which enables to decide a dataset should be logged as an mlflow artifact (i.e. the versioned parameters in catalog.ymlis replaced by a use_mlflow: true that you can pass to any dataset. This logs automatically the dataset as an mlflow artifact. As a best practice, we consider that we should versioned only datasets that are fitted on data (e.g. encoder, binarizer, machine learning models…)
Each time run_node is called, the parameters that are used in the node are logged as mlflow parameters (through mlflow.log_params). This is customizable in the mlfow.yml conf file.
Implement a cli kedro pull --run-id MLFLOW_RUN_ID that enables to get data from an mlflow run and to copy them in your data folder. This is really convenient to share a run with coworkers (especially since we can also retrieve the commit sha from mlflow to have the exact same code). This pull command also pull parameters and write them in an mlflow_parameters.yml. It warns you about conflicts (parameters which exists both in your local conf and the mlflow run you’ve just pulled) and you can select by hand which one you want to keep. (To make kedro pull works, we also decided to log some configuration files as artifacts, including the catalog and the parameters when using kedro run, but this is purely technical)

General thoughts about the feature

I wish I had the thoughts of kedro developers:

Are these functionalities desirable?
If yes, are our design choice the best choices to implement these features?

I can understand that developers want kedro to be “self contained” and not rely on a third party application. However, I think it is definitely not a good idea to reinvent the wheel. Besides, such a change would not be harmful for kedro users :

If they don’t want to version their dataset, it does not change anything
If they don’t want to create a mlflow server, you can just add a “mlruns” folders in the kedro projects that will gather data versioned in mlflow (mlflow can stored data locally, even if it is not intended to trough mlflow server). AFAIK, this is really similar to what is currently done with kedro versioning.

I think it is a good way to use the “best of the 2 worlds” (mlflow offers a configuration through an MLProject file which is overlapping and less flexible that kedro’s AFAIK, so I’d rather stick to kedro for this).

Issue Analytics

State:
Created 4 years ago
Reactions:21
Comments:17 (15 by maintainers)

Top GitHub Comments

3reactions

Galileo-Galileicommented, Oct 14, 2019

Hello @yetudada, many thanks for the reply. I was quite busy at work recently but I will definitely try to make a kedro-mlflow plugin by the end of the year.

Some comments about the different points you answered to :

I’ve seen the new feature Journal in the development branch, but we definitely want to stick to mlflow for versioning because we also use it to serve the model and to monitor the app.
a. Actually creating a plugin was our first idea but since we made many modifications to the package to adress other specific concerns (especially integration to our internal CI/CD) it was quicker to modify directly the package. I will try to develop a kedro-mlflow plugin by the end of the year if I have enought time to do so. b. The functionality 3. cannot be implemented as a node decorator (this was what we made at first in our first sprint). Indeed there are 3 things to map for a variable : its name in the catalog, the name of the arguments of the function is is passed through and its value. In the following code snippet, we need to access inputs dictionary,

https://github.com/quantumblacklabs/kedro/blob/e332e2e63f89621da01507b1c1de4c9d644f3ee3/kedro/runner/runner.py#L169-L184

but when passing a decorator I can only access to the kwargs which no longer contains the names of the datacatalog (which are the ones I want to log): https://github.com/quantumblacklabs/kedro/blob/847aa0f4d419a167608b4fb675ea347c7a617bcf/kedro/pipeline/node.py#L460-L461

I do not see how I can log the inputs without access to run_node (but I am open to any less hacky solution).

Thanks for the support! I’ll ask @tolomea when I’ll have a first version of this plugin to discuss about architecture concerns.
I work in a huge bank, but the opinions I express here cannot be considered as those of my employer 😃 Using kedro is internal to my team (and few others AFAIK) and is far from being an official standard.
Btw I’ve seen that most of the features you released in 0.15.2 are very coherent with this discussion (the possibility to kedro run --load-version is very similar to what is described in my kedro mlflow pull --run-id RUN_ID command, and the ability to create modular pipelines is very useful to create a custom “mlflow flavour” for prediction, which is very hacky in our actual implementation).

2reactions

Galileo-Galileicommented, Apr 1, 2020

Hello @yetudada, sorry for not coming back here for a while, I was quite busy at work.

Some news and feedback :

The Journal is quite an interesting feature (it makes runs more reproductible than before with detailed information), but I find it (this is a personal feeling, no offense intended) almost useless without a user interface to browse the different runs and find which one I want to keep / reuse. Mlflow offers this user interface and that is why my team decided to stick to it. Besides, mlflow enables to log metrics / artifacts with the run, which make the runs much more “searchable” (you can easiliy filter / retrieve a run with specific features which seems not easy with the Journal).
A contrib transformer may be the most “kedro compatible” way to do it, but it forces the user to modify its ProjectContext to decide which elements must be logged in mlflow. We do not want this because it creates a lot of configuration back and forth between the ProjectContext of the run.py file and the catalog.yml and this is not very user-friendly. This is also very likely error prone. The solution we came up with is to wrap on the fly Datasets save methods(which enable configuring only in the catalog.yml), but it is quite a dangerous solution because it hides the behaviour to the user. We haven’t decided the best solution yet.
Many thanks, I’ve read the article and I found it very interesting. Thanks for the credit!
We still use our customisation extensively. I’ve read the discussion in #219 and here are some piece of thoughts:

Pros for keeping most of the logic in ProjectContext:

a. It enables handling very specific situation at the project levels, which are not intended to be generic b. I personnaly find the class very easy to extend

Cons for keeping most of the logic in ProjectContext:

a. Currently, I extend the context by creating inheritance fom KedroContext and then make project contex tinhertis from my custom class. One major drawback of this approach is that it is difficult to compose two different logic even if they do not intterfer with each other. **Example: ** Imagine that I have created a specific logic for mlflow :

# mlflow_context.py
class MlflowContext(KedroContext):
    my mlflow logic here

# file run.py of my project
import mlflow_context
class ProjectContext(MlflowContext):
    my project here

Everything is fine. Imagine now that I have also some Spark logic:

# spark_context.py
class SparkContext(KedroContext):
    my spark logic here

I cannot (easily) inherits from both mlflow_context and spark_context. The solution i often use is to define a priority, and ake either MlflowContext or SparkContext inhertis from one another but it is not very satisfying. b. Currently, some templates related methods (e.g. the ones that needs to know the name of the package in the template, (like https://github.com/quantumblacklabs/kedro/blob/57cf26a4ae9f11e942cd630dbb4dda71e1edf034/kedro/template/{{ cookiecutter.repo_name }}/src/{{ cookiecutter.python_package }}/run.py#L47-L48 which needs to import create_pipelines based on the template name) must be written in ProjectContext and not in a mother class, which is not very user friendly and makes portability more difficult.

Conclusion: I have never used pluggy before and I may be wrong, but i had a quick glance at the documentation and it seems to be able to overcome these shortcomings which is IMHO another step forward in the right direction. I’d be glad to see what’s coming next in kedro. Once again, I think that plugin management in kedro is already fantastic and enable to custom it both easily and deeply. Thanks for the amazing job!