[FR] SqlAlchemyStore.log_batch should really write in batches
See original GitHub issueMLflow Roadmap Item
This is an MLflow Roadmap item that has been prioritized by the MLflow maintainers. We’re seeking help with the implementation of roadmap items tagged with the help wanted
label.
For requirements clarifications and implementation questions, or to request a PR review, please tag @WeichenXu123 in your communications related to this issue.
Proposal Summary
Since MLFlow can easily get really slow, it would be great if the log_batch method really would log in batches. This method currently only iterates over all metrics, params and tags and then calls the log_metric/param/tag methods on each. In my understanding each of these calls opens a new database connection which is very slow. It seriously slows down my training loop when I log the losses in each batch of each epoch.
Motivation
See the following snippet from sqlalchemy_store.py
:
def log_batch(self, run_id, metrics, params, tags):
_validate_run_id(run_id)
_validate_batch_log_data(metrics, params, tags)
_validate_batch_log_limits(metrics, params, tags)
with self.ManagedSessionMaker() as session:
run = self._get_run(run_uuid=run_id, session=session)
self._check_run_is_active(run)
try:
for param in params:
self.log_param(run_id, param)
for metric in metrics:
self.log_metric(run_id, metric)
for tag in tags:
self.set_tag(run_id, tag)
except MlflowException as e:
raise e
except Exception as e:
raise MlflowException(e, INTERNAL_ERROR)
-
What is the use case for this feature? Imagine you are doing a hyperparameter search (e.g. using hyperopt or optuna) and you want to log each of the trials in detail. In my example I am traning 20 models in parallel on 4 GPUs and I want to log the losses. When using the default file storage (mlruns folder), this works very well, but then the UI gets extremely slow. Therefore I am switching to a database backend which makes the UI really fast, but unfortunately the logging very slow.
-
Why is this use case valuable to support for MLflow users in general? Performance is always important and I have already seen many questions about performance issues in MLFlow.
-
Why is this use case valuable to support for your project(s) or organization? Same as above.
-
Why is it currently difficult to achieve this use case? (please be as specific as possible about why related MLflow features and components are insufficient) Logging slows down my training loop a lot. Logging should be highly optimized and the number of DB transactions should be as low as possible.
What component(s), interfaces, languages, and integrations does this feature affect?
Components
-
area/docs
: MLflow documentation pages -
area/artifacts
: Artifact stores and artifact logging -
area/examples
: Example code -
area/model-registry
: Model Registry service, APIs, and the fluent client calls for Model Registry -
area/build
: Build and test infrastructure for MLflow -
area/models
: MLmodel format, model serialization/deserialization, flavors -
area/projects
: MLproject format, project running backends -
area/scoring
: Local serving, model deployment tools, spark UDFs -
area/server-infra
: MLflow server, JavaScript dev server -
area/tracking
: Tracking Service, tracking client APIs, autologging
Interfaces
-
area/uiux
: Front-end, user experience, JavaScript, plotting -
area/docker
: Docker use across MLflow’s components, such as MLflow Projects and MLflow Models -
area/sqlalchemy
: Use of SQLAlchemy in the Tracking Service or Model Registry -
area/windows
: Windows support
Languages
-
language/r
: R APIs and clients -
language/java
: Java APIs and clients -
language/new
: Proposals for new client languages
Integrations
-
integrations/azure
: Azure and Azure ML integrations -
integrations/sagemaker
: SageMaker integrations -
integrations/databricks
: Databricks integrations
Issue Analytics
- State:
- Created 2 years ago
- Reactions:4
- Comments:6 (3 by maintainers)
is that mean using something like ‘bulk_insert_mapping’ or ‘add_all’ in one db transaction? cause by using this way,i can speed up maybe 10 times.
@simonhessner Currently, in the log_batch we are opening a single session doing multiple writes. A more efficient strategy would be to write all in a single batch as opposed to iterative over each MLflow entity in a single DB session.
cc: @harupy @WeichenXu123 for commentary.