Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

[BUG] Service metrics endpoint excludes many important routes

See original GitHub issue

Willingness to contribute

The MLflow Community encourages bug fix contributions. Would you or another member of your organization be willing to contribute a fix for this bug to the MLflow code base?

Yes. I can contribute a fix for this bug independently.
Yes. I would be willing to contribute a fix for this bug with guidance from the MLflow community.
No. I cannot contribute a bug fix at this time.

System information

Have I written custom code (as opposed to using a stock example script provided in MLflow): no
OS Platform and Distribution (e.g., Linux Ubuntu 16.04): Debian GNU/Linux 11
MLflow installed from (source or binary): binary
MLflow version (run mlflow --version): 1.21.0
Python version: 3.8
npm version, if running the dev UI:
Exact command to reproduce: see section below

Describe the problem

The mlflow server option to expose a Prometheus metrics endpoint is a great observability feature for MLflow. Unfortunately, the current implementation leaves an incomplete view of the server health/performance. Currently, mlflow only logs metrics to a subset of endpoints.

As of MLflow version 1.21, the following routes are not being included in the service metrics:

['static', '_get_experiment_by_name', '_create_experiment', '_list_experiments', '_get_experiment',
'_delete_experiment', '_restore_experiment', '_update_experiment',  '_update_run', '_delete_run', 
'_restore_run', '_set_experiment_tag',  '_delete_tag', '_get_run',  '_list_artifacts', '_get_metric_history',
'_log_batch', '_log_model', '_create_registered_model', '_rename_registered_model', 
'_update_registered_model',  '_delete_registered_model', '_get_registered_model', '_search_registered_models', 
'_list_registered_models',  '_get_latest_versions', '_create_model_version', '_update_model_version', 
'_transition_stage', '_delete_model_version',  '_get_model_version', '_search_model_versions', 
'_get_model_version_download_uri', '_set_registered_model_tag', '_set_model_version_tag', 
'_delete_registered_model_tag', '_delete_model_version_tag', 'health', 'serve_artifacts', 
'serve_model_version_artifact', 'serve_static_file', 'serve']

(see full list of endpoints)

from mlflow.server import app

app.view_functions.keys()

Filtering the set of routes to be included in the metrics endpoint seems like a potentially fragile approach as new routes are added in later versions of mlflow. It’s especially problematic that the list of filtered routes cannot be configured. We currently have no way to monitor the health of the overall service given that many key routes (e.g. log_batch) are not included in the service metrics.

Code to reproduce issue

Dockerfile for mlflow server

FROM python:3.8
RUN pip install mlflow==1.21.0

ENTRYPOINT mlflow server \
    --backend-store-uri sqlite:///mlflow.sqlite \
    --default-artifact-root file:///artifacts \
    --host 0.0.0.0 \
    --port 5000 \
    --expose-prometheus /prometheus

Build and run the Docker container

docker build -t mlflow_example -f Dockerfile .
docker run -p 5000:5000 mlflow_example

Script with incomplete representation in metrics endpoint

import mlflow
import random

mlflow.set_tracking_uri("http://127.0.0.1:5000")
mlflow.set_experiment("service_metrics")

with mlflow.start_run(run_name="test"):

    for _ in range(100):
        mlflow.log_metrics({
            'loss_a': random.random(),
            'loss_b': random.random(),
            'loss_c': random.random(),
        })

    mlflow.log_params({'a': 1, 'b': 2, 'c': 3})

See how metrics for these endpoints do not appear at http://127.0.0.1:5000/metrics

Script with expected representation in metrics endpoint

import mlflow
import random

mlflow.set_tracking_uri("http://127.0.0.1:5000")
mlflow.set_experiment("service_metrics")

with mlflow.start_run(run_name="test"):
    for _ in range(100):
        mlflow.log_metric('loss', random.random())

    mlflow.log_param('param', 'test')

See how metrics for these endpoints appear at http://127.0.0.1:5000/metrics

Other info / logs

Include any logs or source code that would be helpful to diagnose the problem. If including tracebacks, please include the full traceback. Large logs and files should be attached.

What component(s), interfaces, languages, and integrations does this bug affect?

Components

area/artifacts: Artifact stores and artifact logging
area/build: Build and test infrastructure for MLflow
area/docs: MLflow documentation pages
area/examples: Example code
area/model-registry: Model Registry service, APIs, and the fluent client calls for Model Registry
area/models: MLmodel format, model serialization/deserialization, flavors
area/projects: MLproject format, project running backends
area/scoring: MLflow Model server, model deployment tools, Spark UDFs
area/server-infra: MLflow Tracking server backend
area/tracking: Tracking Service, tracking client APIs, autologging

Interface

area/uiux: Front-end, user experience, plotting, JavaScript, JavaScript dev server
area/docker: Docker use across MLflow’s components, such as MLflow Projects and MLflow Models
area/sqlalchemy: Use of SQLAlchemy in the Tracking Service or Model Registry
area/windows: Windows support

Language

language/r: R APIs and clients
language/java: Java APIs and clients
language/new: Proposals for new client languages

Integrations

integrations/azure: Azure and Azure ML integrations
integrations/sagemaker: SageMaker integrations
integrations/databricks: Databricks integrations

Issue Analytics

State:
Created 2 years ago
Comments:6 (5 by maintainers)

Top GitHub Comments

1reaction

dbczumarcommented, Nov 9, 2021

Hi @jeremyjordan , your proposal sounds good! We’d be happy to review a PR for this.

0reactions

jeremyjordancommented, Nov 29, 2021

Sounds great @dbczumar! I opened a PR for this issue this morning.

Top Results From Across the Web

prometheus-flask-exporter - PyPI

Prometheus Flask exporter. PyPI - Downloads Coverage Status Code Climate Test & publish package. This library provides HTTP request metrics to export into ......

Understanding Traffic Routing - Istio

To understand what is happening in your mesh, it is important to understand how Istio routes traffic. This document describes low level implementation ......

RHSA-2022:5069 - Security Advisory - Red Hat 고객 포털

prometheus/client_golang: Denial of service using InstrumentHandlerCounter (CVE-2022-21698); golang: crash in a golang.org/x/crypto/ssh server ( ...

Understanding Route Aggregation in BGP - Cisco

Border Gateway Protocol (BGP) allows the aggregation of specific routes into one route with use of the aggregate-address address mask ...

Grafana Mimir runbooks

The query excludes tenants which are already sharded across all ingesters: ... The alert message includes both the Mimir service and route experiencing...