Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

[FR] SqlAlchemyStore.log_batch should really write in batches

See original GitHub issue

MLflow Roadmap Item

This is an MLflow Roadmap item that has been prioritized by the MLflow maintainers. We’re seeking help with the implementation of roadmap items tagged with the help wanted label.

For requirements clarifications and implementation questions, or to request a PR review, please tag @WeichenXu123 in your communications related to this issue.

Proposal Summary

Since MLFlow can easily get really slow, it would be great if the log_batch method really would log in batches. This method currently only iterates over all metrics, params and tags and then calls the log_metric/param/tag methods on each. In my understanding each of these calls opens a new database connection which is very slow. It seriously slows down my training loop when I log the losses in each batch of each epoch.

Motivation

See the following snippet from sqlalchemy_store.py:

def log_batch(self, run_id, metrics, params, tags):
    _validate_run_id(run_id)
    _validate_batch_log_data(metrics, params, tags)
    _validate_batch_log_limits(metrics, params, tags)
    with self.ManagedSessionMaker() as session:
        run = self._get_run(run_uuid=run_id, session=session)
        self._check_run_is_active(run)
    try:
        for param in params:
            self.log_param(run_id, param)
        for metric in metrics:
            self.log_metric(run_id, metric)
        for tag in tags:
            self.set_tag(run_id, tag)
    except MlflowException as e:
        raise e
    except Exception as e:
        raise MlflowException(e, INTERNAL_ERROR)

What is the use case for this feature? Imagine you are doing a hyperparameter search (e.g. using hyperopt or optuna) and you want to log each of the trials in detail. In my example I am traning 20 models in parallel on 4 GPUs and I want to log the losses. When using the default file storage (mlruns folder), this works very well, but then the UI gets extremely slow. Therefore I am switching to a database backend which makes the UI really fast, but unfortunately the logging very slow.
Why is this use case valuable to support for MLflow users in general? Performance is always important and I have already seen many questions about performance issues in MLFlow.
Why is this use case valuable to support for your project(s) or organization? Same as above.
Why is it currently difficult to achieve this use case? (please be as specific as possible about why related MLflow features and components are insufficient) Logging slows down my training loop a lot. Logging should be highly optimized and the number of DB transactions should be as low as possible.

What component(s), interfaces, languages, and integrations does this feature affect?

Components

area/docs: MLflow documentation pages
area/artifacts: Artifact stores and artifact logging
area/examples: Example code
area/model-registry: Model Registry service, APIs, and the fluent client calls for Model Registry
area/build: Build and test infrastructure for MLflow
area/models: MLmodel format, model serialization/deserialization, flavors
area/projects: MLproject format, project running backends
area/scoring: Local serving, model deployment tools, spark UDFs
area/server-infra: MLflow server, JavaScript dev server
area/tracking: Tracking Service, tracking client APIs, autologging

Interfaces

area/uiux: Front-end, user experience, JavaScript, plotting
area/docker: Docker use across MLflow’s components, such as MLflow Projects and MLflow Models
area/sqlalchemy: Use of SQLAlchemy in the Tracking Service or Model Registry
area/windows: Windows support

Languages

language/r: R APIs and clients
language/java: Java APIs and clients
language/new: Proposals for new client languages

Integrations

integrations/azure: Azure and Azure ML integrations
integrations/sagemaker: SageMaker integrations
integrations/databricks: Databricks integrations

Issue Analytics

State:
Created 2 years ago
Reactions:4
Comments:6 (3 by maintainers)

Top GitHub Comments

1reaction

kevin19930919commented, Nov 19, 2021

is that mean using something like ‘bulk_insert_mapping’ or ‘add_all’ in one db transaction? cause by using this way,i can speed up maybe 10 times.

1reaction

dmatrixcommented, May 3, 2021

@simonhessner Currently, in the log_batch we are opening a single session doing multiple writes. A more efficient strategy would be to write all in a single batch as opposed to iterative over each MLflow entity in a single DB session.

cc: @harupy @WeichenXu123 for commentary.