Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Develop experiment management module

See original GitHub issue

Is your feature request related to a problem? Please describe. To record and track the training experiments clearly, experiment management is a necessary module.

Identify the typical user stories
Identify the features we should support
~Design the module and APIs, which can easily support different backends, like MLFlow, AIM, etc.~
Try to apply MLFlow in the Auto3DSeg application.

Issue Analytics

State:
Created a year ago
Comments:5 (5 by maintainers)

Top GitHub Comments

1reaction

mingxin-zhengcommented, Oct 19, 2022

Hi @Nic-Ma @binliunls @dongyang0122 @ericspod @wyli

Here are some thoughts of mine about MLFlow for Auto3DSeg, from two perspectives: user experience and implementation for the release in MONAI v1.1. Thanks!

User Experience:

User can log experiments on localhost and remote tracking server.
User can enable MLFlow in Auto3DSeg modules AlgoGen/BundleGen.
There is two new user arguments train_local and tracking_url. If user wants to run all trainings locally, train_local should be True and tracking_url should be set to ‘localhost’. Then MLFlow server will start immediately after the BundleGen/AlgoGen locally. If it is meant to be local, then it will print a message for the user to start the service remotely. It is the user’s job to run the server on a remote machine.
When the jobs are done locally, the user can continue to use algo.train() to start trainings with experiment management ON or OFF.
When the jobs will be dispatched remotely, the user needs to override the training command with experiment management arguments, including but not limited to enable_mlflow, tracking_url, experiment_name, params, metrics and so on. Optionally, they can use algo._create_cmd() to see the command to run. Below are some drafts of MLFlow related arguments for the training to take:
- enable_mlflow: use mlflow as backend
- tracking_url: use localhost or remote ip address for the mlflow server
- experiment_name, required by mlflow
- params: a set of keys to log in training (before the iterations)
- metrics: a set of keys to log in training (during the iterations)

Implementation

A new base class ExperimentManager with MLFlowExperimentManager as the only subclass in MONAI 1.1 .
The MLFlowExperimentManager can initiate the server locally and records where it keeps the database. It can print a helper message if the server will start remotely. (Local server use SQLite as backend?)
The MLFlowExperimentManager manages experiment_name and run_name
The MLFlowExperimentManager manages a list of params names to log. About log_params in mlflow:

log_param and log_params, for logging anything that is "onetime" for each experiment-run, including model parameters and other hyperparameters. An error will be thrown if the same parameter name is logged more than once in the same run.

The MLFlowExperimentManager manages another list of metrics names to log. About log_metrics in mlflow:

log_metric and log_metrics, for logging numerical values during training. Epoch numbers need to be specified; otherwise, MLFlow will report a conflict error.

Support of pictures/text files (artifacts) are excluded from 1.1 unless we are making good progress with other items…
Finally, the name of params and metrics need to be exactly the same. For example, if we’re tracking max_epochs in the param buffer, the variable in the train.py has to be max_epochs. It can’t be total_epochs or num_epochs.
With assumptions in 7, we may iterate all the buffer items by finding the params and metrics during the running of train.py. If the key is the name of a variable value, then it will trigger the mlflow.log_metrics or mlflow.log_params wrapped inside the MLFlowExperimentManager.

Resources

1reaction

ericspodcommented, Sep 2, 2022

I’m starting to write bundles which choose new output directories every time the training script is invoked so that runs get placed in unique locations. I want to record the loggers to a log file in that directory but it would also be good to write the current configuration that bundle is using so that one can see what was changed one run to the next. This won’t include any auxillary code the bundle uses but it would be most of the way there of keeping track of what environment the run used that generated the data in that directory. This is also lighter weight than tools like mlflow and would suit environments this can’t be used in.

Top Results From Across the Web

15 Best Tools for ML Experiment Tracking and Management

Neptune Weights & Biases Comet Sacred & Omniboard MLflow Web UI or console‑based? Web UI Web UI Web UI Web UI Web UI – Dataset...

Experiment Management: How to Organize Your Model ...

Experiment Management : How to Organize Your Model Development Process · Code version control for data science · Tracking hyperparameters · Data versioning....

Managing Experimentation: Module Overview Note for ...

Describes the conceptual foundations and pedagogy for a module on managing experimentation in the development of products and services. The module has been ......

Experiment Manager - JOAN documentation

Using the JOAN core modules »; Experiment Manager. Module: Experiment Manager ... create or modify an experiment,; add conditions to the experiment, ...

10 tips for machine learning experiment tracking and ...

No matter whether you do it yourself or use an experiment management platform, just do it! Build ...