question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

[FR] SqlAlchemyStore.log_batch should really write in batches

See original GitHub issue

MLflow Roadmap Item

This is an MLflow Roadmap item that has been prioritized by the MLflow maintainers. We’re seeking help with the implementation of roadmap items tagged with the help wanted label.

For requirements clarifications and implementation questions, or to request a PR review, please tag @WeichenXu123 in your communications related to this issue.

Proposal Summary

Since MLFlow can easily get really slow, it would be great if the log_batch method really would log in batches. This method currently only iterates over all metrics, params and tags and then calls the log_metric/param/tag methods on each. In my understanding each of these calls opens a new database connection which is very slow. It seriously slows down my training loop when I log the losses in each batch of each epoch.

Motivation

See the following snippet from sqlalchemy_store.py:

def log_batch(self, run_id, metrics, params, tags):
    _validate_run_id(run_id)
    _validate_batch_log_data(metrics, params, tags)
    _validate_batch_log_limits(metrics, params, tags)
    with self.ManagedSessionMaker() as session:
        run = self._get_run(run_uuid=run_id, session=session)
        self._check_run_is_active(run)
    try:
        for param in params:
            self.log_param(run_id, param)
        for metric in metrics:
            self.log_metric(run_id, metric)
        for tag in tags:
            self.set_tag(run_id, tag)
    except MlflowException as e:
        raise e
    except Exception as e:
        raise MlflowException(e, INTERNAL_ERROR)
  • What is the use case for this feature? Imagine you are doing a hyperparameter search (e.g. using hyperopt or optuna) and you want to log each of the trials in detail. In my example I am traning 20 models in parallel on 4 GPUs and I want to log the losses. When using the default file storage (mlruns folder), this works very well, but then the UI gets extremely slow. Therefore I am switching to a database backend which makes the UI really fast, but unfortunately the logging very slow.

  • Why is this use case valuable to support for MLflow users in general? Performance is always important and I have already seen many questions about performance issues in MLFlow.

  • Why is this use case valuable to support for your project(s) or organization? Same as above.

  • Why is it currently difficult to achieve this use case? (please be as specific as possible about why related MLflow features and components are insufficient) Logging slows down my training loop a lot. Logging should be highly optimized and the number of DB transactions should be as low as possible.

What component(s), interfaces, languages, and integrations does this feature affect?

Components

  • area/docs: MLflow documentation pages
  • area/artifacts: Artifact stores and artifact logging
  • area/examples: Example code
  • area/model-registry: Model Registry service, APIs, and the fluent client calls for Model Registry
  • area/build: Build and test infrastructure for MLflow
  • area/models: MLmodel format, model serialization/deserialization, flavors
  • area/projects: MLproject format, project running backends
  • area/scoring: Local serving, model deployment tools, spark UDFs
  • area/server-infra: MLflow server, JavaScript dev server
  • area/tracking: Tracking Service, tracking client APIs, autologging

Interfaces

  • area/uiux: Front-end, user experience, JavaScript, plotting
  • area/docker: Docker use across MLflow’s components, such as MLflow Projects and MLflow Models
  • area/sqlalchemy: Use of SQLAlchemy in the Tracking Service or Model Registry
  • area/windows: Windows support

Languages

  • language/r: R APIs and clients
  • language/java: Java APIs and clients
  • language/new: Proposals for new client languages

Integrations

  • integrations/azure: Azure and Azure ML integrations
  • integrations/sagemaker: SageMaker integrations
  • integrations/databricks: Databricks integrations

Issue Analytics

  • State:closed
  • Created 2 years ago
  • Reactions:4
  • Comments:6 (3 by maintainers)

github_iconTop GitHub Comments

1reaction
kevin19930919commented, Nov 19, 2021

is that mean using something like ‘bulk_insert_mapping’ or ‘add_all’ in one db transaction? cause by using this way,i can speed up maybe 10 times.

1reaction
dmatrixcommented, May 3, 2021

@simonhessner Currently, in the log_batch we are opening a single session doing multiple writes. A more efficient strategy would be to write all in a single batch as opposed to iterative over each MLflow entity in a single DB session.

cc: @harupy @WeichenXu123 for commentary.

Read more comments on GitHub >

github_iconTop Results From Across the Web

How can I commit batches of entries to an SQL database with ...
Committing a large batch of entries together is much more efficient, but if there is an error in one of the entries, e.g....
Read more >
How to Perform Bulk Inserts With SQLAlchemy Efficiently in ...
It's very convenient to use SQLAlchemy to interact with relational ... In this post, we will introduce different ways for bulk inserts and ......
Read more >
sqlalchemy_batch_inserts: a module for when you're inserting ...
We realized that SQLAlchemy only batches inserts if the primary keys are already defined on your model. In our example, SQLAlchemy needs to...
Read more >
Performance — SQLAlchemy 1.4 Documentation
Here, we want to use the technique described at engine logging, looking for statements with the [no key] indicator or even [dialect does...
Read more >
Bulk Updates and Deletes in Flask-SQLAlchemy - YouTube
Instead of having to update or delete each record one by one, bulk queries can be done all at once. Need one-on-one help...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found