[BUG] Service metrics endpoint excludes many important routes
See original GitHub issueWillingness to contribute
The MLflow Community encourages bug fix contributions. Would you or another member of your organization be willing to contribute a fix for this bug to the MLflow code base?
- Yes. I can contribute a fix for this bug independently.
- Yes. I would be willing to contribute a fix for this bug with guidance from the MLflow community.
- No. I cannot contribute a bug fix at this time.
System information
- Have I written custom code (as opposed to using a stock example script provided in MLflow): no
- OS Platform and Distribution (e.g., Linux Ubuntu 16.04): Debian GNU/Linux 11
- MLflow installed from (source or binary): binary
- MLflow version (run
mlflow --version
): 1.21.0 - Python version: 3.8
- npm version, if running the dev UI:
- Exact command to reproduce: see section below
Describe the problem
The mlflow server option to expose a Prometheus metrics endpoint is a great observability feature for MLflow. Unfortunately, the current implementation leaves an incomplete view of the server health/performance. Currently, mlflow only logs metrics to a subset of endpoints.
As of MLflow version 1.21, the following routes are not being included in the service metrics:
['static', '_get_experiment_by_name', '_create_experiment', '_list_experiments', '_get_experiment',
'_delete_experiment', '_restore_experiment', '_update_experiment', '_update_run', '_delete_run',
'_restore_run', '_set_experiment_tag', '_delete_tag', '_get_run', '_list_artifacts', '_get_metric_history',
'_log_batch', '_log_model', '_create_registered_model', '_rename_registered_model',
'_update_registered_model', '_delete_registered_model', '_get_registered_model', '_search_registered_models',
'_list_registered_models', '_get_latest_versions', '_create_model_version', '_update_model_version',
'_transition_stage', '_delete_model_version', '_get_model_version', '_search_model_versions',
'_get_model_version_download_uri', '_set_registered_model_tag', '_set_model_version_tag',
'_delete_registered_model_tag', '_delete_model_version_tag', 'health', 'serve_artifacts',
'serve_model_version_artifact', 'serve_static_file', 'serve']
(see full list of endpoints)
from mlflow.server import app
app.view_functions.keys()
Filtering the set of routes to be included in the metrics endpoint seems like a potentially fragile approach as new routes are added in later versions of mlflow. It’s especially problematic that the list of filtered routes cannot be configured. We currently have no way to monitor the health of the overall service given that many key routes (e.g. log_batch
) are not included in the service metrics.
Code to reproduce issue
Dockerfile for mlflow server
FROM python:3.8
RUN pip install mlflow==1.21.0
ENTRYPOINT mlflow server \
--backend-store-uri sqlite:///mlflow.sqlite \
--default-artifact-root file:///artifacts \
--host 0.0.0.0 \
--port 5000 \
--expose-prometheus /prometheus
Build and run the Docker container
docker build -t mlflow_example -f Dockerfile .
docker run -p 5000:5000 mlflow_example
Script with incomplete representation in metrics endpoint
import mlflow
import random
mlflow.set_tracking_uri("http://127.0.0.1:5000")
mlflow.set_experiment("service_metrics")
with mlflow.start_run(run_name="test"):
for _ in range(100):
mlflow.log_metrics({
'loss_a': random.random(),
'loss_b': random.random(),
'loss_c': random.random(),
})
mlflow.log_params({'a': 1, 'b': 2, 'c': 3})
See how metrics for these endpoints do not appear at http://127.0.0.1:5000/metrics
Script with expected representation in metrics endpoint
import mlflow
import random
mlflow.set_tracking_uri("http://127.0.0.1:5000")
mlflow.set_experiment("service_metrics")
with mlflow.start_run(run_name="test"):
for _ in range(100):
mlflow.log_metric('loss', random.random())
mlflow.log_param('param', 'test')
See how metrics for these endpoints appear at http://127.0.0.1:5000/metrics
Other info / logs
Include any logs or source code that would be helpful to diagnose the problem. If including tracebacks, please include the full traceback. Large logs and files should be attached.
What component(s), interfaces, languages, and integrations does this bug affect?
Components
-
area/artifacts
: Artifact stores and artifact logging -
area/build
: Build and test infrastructure for MLflow -
area/docs
: MLflow documentation pages -
area/examples
: Example code -
area/model-registry
: Model Registry service, APIs, and the fluent client calls for Model Registry -
area/models
: MLmodel format, model serialization/deserialization, flavors -
area/projects
: MLproject format, project running backends -
area/scoring
: MLflow Model server, model deployment tools, Spark UDFs -
area/server-infra
: MLflow Tracking server backend -
area/tracking
: Tracking Service, tracking client APIs, autologging
Interface
-
area/uiux
: Front-end, user experience, plotting, JavaScript, JavaScript dev server -
area/docker
: Docker use across MLflow’s components, such as MLflow Projects and MLflow Models -
area/sqlalchemy
: Use of SQLAlchemy in the Tracking Service or Model Registry -
area/windows
: Windows support
Language
-
language/r
: R APIs and clients -
language/java
: Java APIs and clients -
language/new
: Proposals for new client languages
Integrations
-
integrations/azure
: Azure and Azure ML integrations -
integrations/sagemaker
: SageMaker integrations -
integrations/databricks
: Databricks integrations
Issue Analytics
- State:
- Created 2 years ago
- Comments:6 (5 by maintainers)
Hi @jeremyjordan , your proposal sounds good! We’d be happy to review a PR for this.
Sounds great @dbczumar! I opened a PR for this issue this morning.