Use mlflow for better versioning and collaboration
See original GitHub issueTL;DR : The plugin is in active development here and is available on PyPI. It already works reliably with kedro>=0.16.0
but is slightly different (and much more complete) of what is described in below issue. Feel free to try it out and give feedback. The plugin enforces Kedro
design principles when integrating mlflow (strict separation I/O vs compute, external configuration, data abstraction, cli wrappers…) to avoid breaking the Kedro
experience when using mlflow and facilitate versioning and model serving.
A huge thanks for the framework, which is really useful. My team decided to use it for most of its projects, especially to ensure collaboration. Data abstraction is really an important feature. However, we have a major disagreement of how data versioning is implemented in kedro. We decided to move on and to develop our own layer of versioning above your framework.
I’d be glad to discuss with kedro developpers about some architectures / design choices about it, and this is the goal of the issue.
Context
Versioning in machine learning is something very specific : you want to version a run, i.e. the execution of code on data with parameters. Versioning data alone is likely to be useless for reproducibility in future.
Databricks released recently mlflow which is intended to match this very goal. I think that it will be beneficial from kedro to build on top of what mlflow has already created in order to :
- offer a better versioning system
- facilitate collaboration between data scientists to “share” models
- not reinvent the wheel
Description
The current internal versioning method in kedro does not intend to version a full “run” (code + data +parameters) which make it less useful for machine learning. Switching to mlflow for this would be a quick win to improve the framework.
Possible Implementation
My team had implemented several features :
- Implement a configuration file for mlflow (
mlflow.yml
file in conf/base folder) which enables to parameterize all mlflow features through a conf file (autologging parameters, tracking uri, experiment where the run should be stored…) which is added to the template. This is really useful since we used a “local” mlflow server where each data scientist can experiment, and a shared one with shareable models and runs and this is nice to parameterize this through a config file. - Create an
MlflowDataset
class (similar to theAbstractVersionedDataset
class) which enables to decide a dataset should be logged as an mlflow artifact (i.e. theversioned
parameters incatalog.yml
is replaced by ause_mlflow: true
that you can pass to any dataset. This logs automatically the dataset as an mlflow artifact. As a best practice, we consider that we should versioned only datasets that are fitted on data (e.g. encoder, binarizer, machine learning models…) - Each time
run_node
is called, the parameters that are used in the node are logged as mlflow parameters (throughmlflow.log_params
). This is customizable in themlfow.yml
conf file. - Implement a cli
kedro pull --run-id MLFLOW_RUN_ID
that enables to get data from an mlflow run and to copy them in yourdata
folder. This is really convenient to share a run with coworkers (especially since we can also retrieve the commit sha from mlflow to have the exact same code). Thispull
command also pull parameters and write them in anmlflow_parameters.yml
. It warns you about conflicts (parameters which exists both in your local conf and the mlflow run you’ve just pulled) and you can select by hand which one you want to keep. (To makekedro pull
works, we also decided to log some configuration files as artifacts, including the catalog and the parameters when usingkedro run
, but this is purely technical)
General thoughts about the feature
I wish I had the thoughts of kedro developers:
- Are these functionalities desirable?
- If yes, are our design choice the best choices to implement these features?
I can understand that developers want kedro to be “self contained” and not rely on a third party application. However, I think it is definitely not a good idea to reinvent the wheel. Besides, such a change would not be harmful for kedro users :
- If they don’t want to version their dataset, it does not change anything
- If they don’t want to create a mlflow server, you can just add a “mlruns” folders in the kedro projects that will gather data versioned in mlflow (mlflow can stored data locally, even if it is not intended to trough mlflow server). AFAIK, this is really similar to what is currently done with kedro versioning.
I think it is a good way to use the “best of the 2 worlds” (mlflow offers a configuration through an MLProject
file which is overlapping and less flexible that kedro’s AFAIK, so I’d rather stick to kedro for this).
Issue Analytics
- State:
- Created 4 years ago
- Reactions:21
- Comments:17 (15 by maintainers)
Top GitHub Comments
Hello @yetudada, many thanks for the reply. I was quite busy at work recently but I will definitely try to make a kedro-mlflow plugin by the end of the year.
Some comments about the different points you answered to :
I’ve seen the new feature
Journal
in the development branch, but we definitely want to stick to mlflow for versioning because we also use it to serve the model and to monitor the app.a. Actually creating a plugin was our first idea but since we made many modifications to the package to adress other specific concerns (especially integration to our internal CI/CD) it was quicker to modify directly the package. I will try to develop a
kedro-mlflow
plugin by the end of the year if I have enought time to do so. b. The functionality 3. cannot be implemented as a node decorator (this was what we made at first in our first sprint). Indeed there are 3 things to map for a variable : its name in the catalog, the name of the arguments of the function is is passed through and its value. In the following code snippet, we need to accessinputs
dictionary,https://github.com/quantumblacklabs/kedro/blob/e332e2e63f89621da01507b1c1de4c9d644f3ee3/kedro/runner/runner.py#L169-L184
but when passing a decorator I can only access to the
kwargs
which no longer contains the names of the datacatalog (which are the ones I want to log): https://github.com/quantumblacklabs/kedro/blob/847aa0f4d419a167608b4fb675ea347c7a617bcf/kedro/pipeline/node.py#L460-L461I do not see how I can log the inputs without access to run_node (but I am open to any less hacky solution).
Thanks for the support! I’ll ask @tolomea when I’ll have a first version of this plugin to discuss about architecture concerns.
I work in a huge bank, but the opinions I express here cannot be considered as those of my employer 😃 Using kedro is internal to my team (and few others AFAIK) and is far from being an official standard.
Btw I’ve seen that most of the features you released in 0.15.2 are very coherent with this discussion (the possibility to
kedro run --load-version
is very similar to what is described in mykedro mlflow pull --run-id RUN_ID
command, and the ability to create modular pipelines is very useful to create a custom “mlflow flavour” for prediction, which is very hacky in our actual implementation).Hello @yetudada, sorry for not coming back here for a while, I was quite busy at work.
Some news and feedback :
The
Journal
is quite an interesting feature (it makes runs more reproductible than before with detailed information), but I find it (this is a personal feeling, no offense intended) almost useless without a user interface to browse the different runs and find which one I want to keep / reuse. Mlflow offers this user interface and that is why my team decided to stick to it. Besides, mlflow enables to log metrics / artifacts with the run, which make the runs much more “searchable” (you can easiliy filter / retrieve a run with specific features which seems not easy with theJournal
).A contrib
transformer
may be the most “kedro compatible” way to do it, but it forces the user to modify itsProjectContext
to decide which elements must be logged in mlflow. We do not want this because it creates a lot of configuration back and forth between theProjectContext
of therun.py
file and thecatalog.yml
and this is not very user-friendly. This is also very likely error prone. The solution we came up with is to wrap on the flyDatasets
save
methods(which enable configuring only in thecatalog.yml
), but it is quite a dangerous solution because it hides the behaviour to the user. We haven’t decided the best solution yet.Many thanks, I’ve read the article and I found it very interesting. Thanks for the credit!
We still use our customisation extensively. I’ve read the discussion in #219 and here are some piece of thoughts:
Pros for keeping most of the logic in ProjectContext:
a. It enables handling very specific situation at the project levels, which are not intended to be generic b. I personnaly find the class very easy to extend
Cons for keeping most of the logic in ProjectContext:
a. Currently, I extend the context by creating inheritance fom
KedroContext
and then make project contex tinhertis from my custom class. One major drawback of this approach is that it is difficult to compose two different logic even if they do not intterfer with each other. **Example: ** Imagine that I have created a specific logic for mlflow :Everything is fine. Imagine now that I have also some Spark logic:
I cannot (easily) inherits from both
mlflow_context
andspark_context
. The solution i often use is to define a priority, and ake either MlflowContext or SparkContext inhertis from one another but it is not very satisfying. b. Currently, some templates related methods (e.g. the ones that needs to know the name of the package in the template, (like https://github.com/quantumblacklabs/kedro/blob/57cf26a4ae9f11e942cd630dbb4dda71e1edf034/kedro/template/{{ cookiecutter.repo_name }}/src/{{ cookiecutter.python_package }}/run.py#L47-L48 which needs to importcreate_pipelines
based on the template name) must be written inProjectContext
and not in a mother class, which is not very user friendly and makes portability more difficult.Conclusion: I have never used
pluggy
before and I may be wrong, but i had a quick glance at the documentation and it seems to be able to overcome these shortcomings which is IMHO another step forward in the right direction. I’d be glad to see what’s coming next in kedro. Once again, I think that plugin management in kedro is already fantastic and enable to custom it both easily and deeply. Thanks for the amazing job!