Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

[FR] Enable support for mlflow evaluate() on Keras multiclass models

See original GitHub issue

Issues Policy acknowledgement

I have read and agree to submit bug reports in accordance with the issues policy

Willingness to contribute

No. I cannot contribute a bug fix at this time.

MLflow version

Client: 1.30.0

System information

OS Platform and Distribution (e.g., Linux Ubuntu 16.04): Ubuntu 20.04.5 LTS
Python version: 3.8.10

Describe the problem

Hardcoded metrics in default evaluator forbid from using mlflow.evaluate() on tf.keras multiclass model. It works well for sklearn multiclass models or xgboost and so on.

it has something to do with the fact that this doesn’t work:

from sklearn.metrics import accuracy_score

y_pred = [[0.5, 1], [-1, 1], [7, -6]]
y_true = [[0, 2], [-1, 2], [8, -5]]
accuracy_score(y_true, y_pred)

and how it should be used in that case:

accuracy_score(np.array([[0, 1], [1, 1]]), np.ones((2, 2)))

I suppose the way to work it around would be to write my own evaluator. However, it feels that it could be solved if our own metrics could be provided, e.g. through evaluator_config:Dict

Tracking information

No response

Code to reproduce issue

import numpy as np
import matplotlib.pyplot as plt
from tensorflow.keras.utils import to_categorical
from keras.datasets import fashion_mnist
from keras.models import Sequential, Model
from keras.layers import Dense, Input

# Import fashion MNIST dataset
(X_train, y_train), (X_test, y_test) = fashion_mnist.load_data()

# Display the first 7 images
fig, axes = plt.subplots(ncols=7, sharex=False,
						 sharey=True, figsize=(16, 4))
for i in range(7):
	axes[i].set_title(y_train[i])
	axes[i].imshow(X_train[i], cmap='gray')
	axes[i].get_xaxis().set_visible(False)
	axes[i].get_yaxis().set_visible(False)
plt.show()

print("Original shape of X_train =", X_train.shape)
print("Original shape of X_test =", X_test.shape, end='\n')

# Reshape X_train to (60000, 784) and X_test to (10000, 784)
X_train = X_train.reshape(X_train.shape[0], X_train.shape[1]*X_train.shape[2])
X_test = X_test.reshape(X_test.shape[0], X_test.shape[1]*X_test.shape[2])

print("New X_train shape", X_train.shape)
print("New X_test shape", X_test.shape, end='\n')

# Convert target (y_train and y_test) into one-hot
temp = []
for i in range(len(y_train)):
    temp.append(to_categorical(y_train[i], num_classes=10))
    
y_train = np.array(temp)

temp = []
for i in range(len(y_test)):
    temp.append(to_categorical(y_test[i], num_classes=10))

y_test = np.array(temp)

# Create and train sequential model
model_seq = Sequential()
model_seq.add(Dense(5, activation='sigmoid', input_shape=(X_train.shape[1],)))
model_seq.add(Dense(4, activation='sigmoid'))
model_seq.add(Dense(10, activation='softmax'))

model_seq.summary()

model_seq.compile(loss='categorical_crossentropy', 
                  optimizer='adam', 
                  metrics=['acc'])

model_seq.fit(X_train, y_train, epochs=3, 
              validation_data=(X_test,y_test))

# Create and train functional model
input1 = Input(shape=(X_train.shape[1],))
hidden1 = Dense(5, activation='sigmoid')(input1)
hidden2 = Dense(4, activation='sigmoid')(hidden1)
output = Dense(10, activation='softmax')(hidden2)
model_func = Model(inputs=input1, outputs=output)

model_func.summary()

model_func.compile(loss='categorical_crossentropy', 
                   optimizer='adam', 
                   metrics=['acc'])

model_func.fit(X_train, y_train, epochs=3, 
               validation_data=(X_test,y_test))
# create eval pd.df 
targets = np.where(y_test == np.amax(y_test))[1]

df = pd.DataFrame(X_test)
df['target'] = targets


#mlflow evaluation
with mlflow.start_run() as run:
    model_info = mlflow.sklearn.log_model(model_func, "model")
    result = mlflow.evaluate(
       model_info.model_uri,
       df,
       targets="target",
       model_type="classifier",
       dataset_name="adult",
       evaluators=["default"],
    )

Stack trace

ValueError: Classification metrics can't handle a mix of multiclass and continuous-multioutput targets
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<command-1847767594277258> in <module>
     83 with mlflow.start_run() as run:
     84     model_info = mlflow.sklearn.log_model(model_func, "model")
---> 85     result = mlflow.evaluate(
     86        model_info.model_uri,
     87        df,

/databricks/python/lib/python3.8/site-packages/mlflow/models/evaluation/base.py in evaluate(model, data, targets, model_type, dataset_name, dataset_path, feature_names, evaluators, evaluator_config, custom_metrics, validation_thresholds, baseline_model, env_manager)
   1241     with _start_run_or_reuse_active_run() as run_id:
   1242         try:
-> 1243             evaluate_result = _evaluate(
   1244                 model=model,
   1245                 model_type=model_type,

/databricks/python_shell/dbruntime/MLWorkloadsInstrumentation/_evaluation.py in patched_evaluate(model, model_type, dataset, run_id, evaluator_name_list, evaluator_name_to_conf_map, **kwargs)
     38             try:
     39                 original_succeeded = False
---> 40                 original_result = original_evaluate_fn(
     41                     model=model,
     42                     model_type=model_type,

/databricks/python/lib/python3.8/site-packages/mlflow/models/evaluation/base.py in _evaluate(model, model_type, dataset, run_id, evaluator_name_list, evaluator_name_to_conf_map, custom_metrics, baseline_model)
    814         if evaluator.can_evaluate(model_type=model_type, evaluator_config=config):
    815             _logger.info(f"Evaluating the model with the {evaluator_name} evaluator.")
--> 816             eval_result = evaluator.evaluate(
    817                 model=model,
    818                 model_type=model_type,

/databricks/python/lib/python3.8/site-packages/mlflow/models/evaluation/default_evaluator.py in evaluate(self, model, model_type, dataset, run_id, evaluator_config, custom_metrics, baseline_model, **kwargs)
   1226             if baseline_model:
   1227                 _logger.info("Evaluating candidate model:")
-> 1228             evaluation_result = self._evaluate(model, is_baseline_model=False)
   1229 
   1230         if not baseline_model:

/databricks/python/lib/python3.8/site-packages/mlflow/models/evaluation/default_evaluator.py in _evaluate(self, model, is_baseline_model, **kwargs)
   1175             with mlflow.utils.autologging_utils.disable_autologging():
   1176                 self._generate_model_predictions()
-> 1177                 self._compute_builtin_metrics()
   1178                 self._evaluate_custom_metrics_and_log_produced_artifacts(
   1179                     log_to_mlflow_tracking=not is_baseline_model

/databricks/python/lib/python3.8/site-packages/mlflow/models/evaluation/default_evaluator.py in _compute_builtin_metrics(self)
   1114                 average = self.evaluator_config.get("average", "weighted")
   1115                 self.metrics.update(
-> 1116                     _get_multiclass_classifier_metrics(
   1117                         y_true=self.y,
   1118                         y_pred=self.y_pred,

/databricks/python/lib/python3.8/site-packages/mlflow/models/evaluation/default_evaluator.py in _get_multiclass_classifier_metrics(y_true, y_pred, y_proba, labels, average, sample_weights)
    225     sample_weights=None,
    226 ):
--> 227     return _get_common_classifier_metrics(
    228         y_true=y_true,
    229         y_pred=y_pred,

/databricks/python/lib/python3.8/site-packages/mlflow/models/evaluation/default_evaluator.py in _get_common_classifier_metrics(y_true, y_pred, y_proba, labels, average, pos_label, sample_weights)
    164     metrics = {
    165         "example_count": len(y_true),
--> 166         "accuracy_score": sk_metrics.accuracy_score(y_true, y_pred, sample_weight=sample_weights),
    167         "recall_score": sk_metrics.recall_score(
    168             y_true,

/databricks/python/lib/python3.8/site-packages/sklearn/utils/validation.py in inner_f(*args, **kwargs)
     61             extra_args = len(args) - len(all_args)
     62             if extra_args <= 0:
---> 63                 return f(*args, **kwargs)
     64 
     65             # extra_args > 0

/databricks/python/lib/python3.8/site-packages/sklearn/metrics/_classification.py in accuracy_score(y_true, y_pred, normalize, sample_weight)
    200 
    201     # Compute accuracy for each possible representation
--> 202     y_type, y_true, y_pred = _check_targets(y_true, y_pred)
    203     check_consistent_length(y_true, y_pred, sample_weight)
    204     if y_type.startswith('multilabel'):

/databricks/python/lib/python3.8/site-packages/sklearn/metrics/_classification.py in _check_targets(y_true, y_pred)
     90 
     91     if len(y_type) > 1:
---> 92         raise ValueError("Classification metrics can't handle a mix of {0} "
     93                          "and {1} targets".format(type_true, type_pred))
     94 

ValueError: Classification metrics can't handle a mix of multiclass and continuous-multioutput targets

Other info / logs

No response

What component(s) does this bug affect?

area/artifacts: Artifact stores and artifact logging
area/build: Build and test infrastructure for MLflow
area/docs: MLflow documentation pages
area/examples: Example code
area/model-registry: Model Registry service, APIs, and the fluent client calls for Model Registry
area/models: MLmodel format, model serialization/deserialization, flavors
area/recipes: Recipes, Recipe APIs, Recipe configs, Recipe Templates
area/projects: MLproject format, project running backends
area/scoring: MLflow Model server, model deployment tools, Spark UDFs
area/server-infra: MLflow Tracking server backend
area/tracking: Tracking Service, tracking client APIs, autologging

What interface(s) does this bug affect?

area/uiux: Front-end, user experience, plotting, JavaScript, JavaScript dev server
area/docker: Docker use across MLflow’s components, such as MLflow Projects and MLflow Models
area/sqlalchemy: Use of SQLAlchemy in the Tracking Service or Model Registry
area/windows: Windows support

What language(s) does this bug affect?

language/r: R APIs and clients
language/java: Java APIs and clients
language/new: Proposals for new client languages

What integration(s) does this bug affect?

integrations/azure: Azure and Azure ML integrations
integrations/sagemaker: SageMaker integrations
integrations/databricks: Databricks integrations

Issue Analytics

State:
Created 10 months ago
Comments:10 (6 by maintainers)

Top GitHub Comments

1reaction

mlflow-automationcommented, Nov 19, 2022

@BenWilson2 @dbczumar @harupy @WeichenXu123 Please assign a maintainer and start triaging this issue.

1reaction

harupycommented, Nov 15, 2022

@paprocki-r

We’re considering adding a boolean flag indicating that the model outputs probabilities:

mlflow.evaluate(..., evaluator_config={"output_probabilities": True})

Please let us know hear your thoughts on this.

Top Results From Across the Web

MLflow Models — MLflow 2.0.1 documentation

All of the flavors that a particular model supports are defined in its MLmodel file in ... The keras model flavor enables logging...

How to Use MLflow to Experiment a Keras Network Model

In this blog, we demonstrate how to use MLflow to experiment Keras Models. In particular, we build and experiment with a binary classifier ......

Build, train and evaluate models with TensorFlow Decision ...

TensorFlow Decision Forests (TF-DF) is a library for the training, evaluation, interpretation and inference of Decision Forest models.

classification metrics can't handle a mix of multiclass and ...

I am trying to train my model on an image data set by using the command ... mlflow/mlflow[FR] Enable support for mlflow evaluate()...

Available CRAN Packages By Date of Publication

2022-12-17, familiar, End-to-End Automated Machine Learning and Model Evaluation. 2022-12-17, ggpmisc, Miscellaneous Extensions to 'ggplot2'.

Troubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.

Start Free

Top Related Reddit Thread

No results found

Top Related Tweet

No results found

Top Related Dev.to Post

No results found

[FR] Enable support for mlflow evaluate() on Keras multiclass models

Issues Policy acknowledgement

Willingness to contribute

MLflow version

System information

Describe the problem

Tracking information

Code to reproduce issue

Stack trace

Other info / logs

What component(s) does this bug affect?

What interface(s) does this bug affect?

What language(s) does this bug affect?

What integration(s) does this bug affect?

Issue Analytics

Top GitHub Comments

Top Results From Across the Web

Top Related Medium Post

Top Related StackOverflow Question

Troubleshoot Live Code

Top Related Reddit Thread

Top Related Hackernoon Post

Top Related Tweet

Top Related Dev.to Post

Top Related Hashnode Post

Can't see my runs. "No runs yet"

[FR] Model registry semantic versioning