question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

[FR] Enable support for mlflow evaluate() on Keras multiclass models

See original GitHub issue

Issues Policy acknowledgement

  • I have read and agree to submit bug reports in accordance with the issues policy

Willingness to contribute

No. I cannot contribute a bug fix at this time.

MLflow version

  • Client: 1.30.0

System information

  • OS Platform and Distribution (e.g., Linux Ubuntu 16.04): Ubuntu 20.04.5 LTS
  • Python version: 3.8.10

Describe the problem

Hardcoded metrics in default evaluator forbid from using mlflow.evaluate() on tf.keras multiclass model. It works well for sklearn multiclass models or xgboost and so on.

it has something to do with the fact that this doesn’t work:

from sklearn.metrics import accuracy_score

y_pred = [[0.5, 1], [-1, 1], [7, -6]]
y_true = [[0, 2], [-1, 2], [8, -5]]
accuracy_score(y_true, y_pred)

and how it should be used in that case:

accuracy_score(np.array([[0, 1], [1, 1]]), np.ones((2, 2)))

I suppose the way to work it around would be to write my own evaluator. However, it feels that it could be solved if our own metrics could be provided, e.g. through evaluator_config:Dict

Tracking information

No response

Code to reproduce issue

import numpy as np
import matplotlib.pyplot as plt
from tensorflow.keras.utils import to_categorical
from keras.datasets import fashion_mnist
from keras.models import Sequential, Model
from keras.layers import Dense, Input

# Import fashion MNIST dataset
(X_train, y_train), (X_test, y_test) = fashion_mnist.load_data()

# Display the first 7 images
fig, axes = plt.subplots(ncols=7, sharex=False,
						 sharey=True, figsize=(16, 4))
for i in range(7):
	axes[i].set_title(y_train[i])
	axes[i].imshow(X_train[i], cmap='gray')
	axes[i].get_xaxis().set_visible(False)
	axes[i].get_yaxis().set_visible(False)
plt.show()

print("Original shape of X_train =", X_train.shape)
print("Original shape of X_test =", X_test.shape, end='\n')

# Reshape X_train to (60000, 784) and X_test to (10000, 784)
X_train = X_train.reshape(X_train.shape[0], X_train.shape[1]*X_train.shape[2])
X_test = X_test.reshape(X_test.shape[0], X_test.shape[1]*X_test.shape[2])

print("New X_train shape", X_train.shape)
print("New X_test shape", X_test.shape, end='\n')

# Convert target (y_train and y_test) into one-hot
temp = []
for i in range(len(y_train)):
    temp.append(to_categorical(y_train[i], num_classes=10))
    
y_train = np.array(temp)

temp = []
for i in range(len(y_test)):
    temp.append(to_categorical(y_test[i], num_classes=10))

y_test = np.array(temp)

# Create and train sequential model
model_seq = Sequential()
model_seq.add(Dense(5, activation='sigmoid', input_shape=(X_train.shape[1],)))
model_seq.add(Dense(4, activation='sigmoid'))
model_seq.add(Dense(10, activation='softmax'))

model_seq.summary()

model_seq.compile(loss='categorical_crossentropy', 
                  optimizer='adam', 
                  metrics=['acc'])

model_seq.fit(X_train, y_train, epochs=3, 
              validation_data=(X_test,y_test))

# Create and train functional model
input1 = Input(shape=(X_train.shape[1],))
hidden1 = Dense(5, activation='sigmoid')(input1)
hidden2 = Dense(4, activation='sigmoid')(hidden1)
output = Dense(10, activation='softmax')(hidden2)
model_func = Model(inputs=input1, outputs=output)

model_func.summary()

model_func.compile(loss='categorical_crossentropy', 
                   optimizer='adam', 
                   metrics=['acc'])

model_func.fit(X_train, y_train, epochs=3, 
               validation_data=(X_test,y_test))
# create eval pd.df 
targets = np.where(y_test == np.amax(y_test))[1]

df = pd.DataFrame(X_test)
df['target'] = targets


#mlflow evaluation
with mlflow.start_run() as run:
    model_info = mlflow.sklearn.log_model(model_func, "model")
    result = mlflow.evaluate(
       model_info.model_uri,
       df,
       targets="target",
       model_type="classifier",
       dataset_name="adult",
       evaluators=["default"],
    )

Stack trace

ValueError: Classification metrics can't handle a mix of multiclass and continuous-multioutput targets
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<command-1847767594277258> in <module>
     83 with mlflow.start_run() as run:
     84     model_info = mlflow.sklearn.log_model(model_func, "model")
---> 85     result = mlflow.evaluate(
     86        model_info.model_uri,
     87        df,

/databricks/python/lib/python3.8/site-packages/mlflow/models/evaluation/base.py in evaluate(model, data, targets, model_type, dataset_name, dataset_path, feature_names, evaluators, evaluator_config, custom_metrics, validation_thresholds, baseline_model, env_manager)
   1241     with _start_run_or_reuse_active_run() as run_id:
   1242         try:
-> 1243             evaluate_result = _evaluate(
   1244                 model=model,
   1245                 model_type=model_type,

/databricks/python_shell/dbruntime/MLWorkloadsInstrumentation/_evaluation.py in patched_evaluate(model, model_type, dataset, run_id, evaluator_name_list, evaluator_name_to_conf_map, **kwargs)
     38             try:
     39                 original_succeeded = False
---> 40                 original_result = original_evaluate_fn(
     41                     model=model,
     42                     model_type=model_type,

/databricks/python/lib/python3.8/site-packages/mlflow/models/evaluation/base.py in _evaluate(model, model_type, dataset, run_id, evaluator_name_list, evaluator_name_to_conf_map, custom_metrics, baseline_model)
    814         if evaluator.can_evaluate(model_type=model_type, evaluator_config=config):
    815             _logger.info(f"Evaluating the model with the {evaluator_name} evaluator.")
--> 816             eval_result = evaluator.evaluate(
    817                 model=model,
    818                 model_type=model_type,

/databricks/python/lib/python3.8/site-packages/mlflow/models/evaluation/default_evaluator.py in evaluate(self, model, model_type, dataset, run_id, evaluator_config, custom_metrics, baseline_model, **kwargs)
   1226             if baseline_model:
   1227                 _logger.info("Evaluating candidate model:")
-> 1228             evaluation_result = self._evaluate(model, is_baseline_model=False)
   1229 
   1230         if not baseline_model:

/databricks/python/lib/python3.8/site-packages/mlflow/models/evaluation/default_evaluator.py in _evaluate(self, model, is_baseline_model, **kwargs)
   1175             with mlflow.utils.autologging_utils.disable_autologging():
   1176                 self._generate_model_predictions()
-> 1177                 self._compute_builtin_metrics()
   1178                 self._evaluate_custom_metrics_and_log_produced_artifacts(
   1179                     log_to_mlflow_tracking=not is_baseline_model

/databricks/python/lib/python3.8/site-packages/mlflow/models/evaluation/default_evaluator.py in _compute_builtin_metrics(self)
   1114                 average = self.evaluator_config.get("average", "weighted")
   1115                 self.metrics.update(
-> 1116                     _get_multiclass_classifier_metrics(
   1117                         y_true=self.y,
   1118                         y_pred=self.y_pred,

/databricks/python/lib/python3.8/site-packages/mlflow/models/evaluation/default_evaluator.py in _get_multiclass_classifier_metrics(y_true, y_pred, y_proba, labels, average, sample_weights)
    225     sample_weights=None,
    226 ):
--> 227     return _get_common_classifier_metrics(
    228         y_true=y_true,
    229         y_pred=y_pred,

/databricks/python/lib/python3.8/site-packages/mlflow/models/evaluation/default_evaluator.py in _get_common_classifier_metrics(y_true, y_pred, y_proba, labels, average, pos_label, sample_weights)
    164     metrics = {
    165         "example_count": len(y_true),
--> 166         "accuracy_score": sk_metrics.accuracy_score(y_true, y_pred, sample_weight=sample_weights),
    167         "recall_score": sk_metrics.recall_score(
    168             y_true,

/databricks/python/lib/python3.8/site-packages/sklearn/utils/validation.py in inner_f(*args, **kwargs)
     61             extra_args = len(args) - len(all_args)
     62             if extra_args <= 0:
---> 63                 return f(*args, **kwargs)
     64 
     65             # extra_args > 0

/databricks/python/lib/python3.8/site-packages/sklearn/metrics/_classification.py in accuracy_score(y_true, y_pred, normalize, sample_weight)
    200 
    201     # Compute accuracy for each possible representation
--> 202     y_type, y_true, y_pred = _check_targets(y_true, y_pred)
    203     check_consistent_length(y_true, y_pred, sample_weight)
    204     if y_type.startswith('multilabel'):

/databricks/python/lib/python3.8/site-packages/sklearn/metrics/_classification.py in _check_targets(y_true, y_pred)
     90 
     91     if len(y_type) > 1:
---> 92         raise ValueError("Classification metrics can't handle a mix of {0} "
     93                          "and {1} targets".format(type_true, type_pred))
     94 

ValueError: Classification metrics can't handle a mix of multiclass and continuous-multioutput targets

Other info / logs

No response

What component(s) does this bug affect?

  • area/artifacts: Artifact stores and artifact logging
  • area/build: Build and test infrastructure for MLflow
  • area/docs: MLflow documentation pages
  • area/examples: Example code
  • area/model-registry: Model Registry service, APIs, and the fluent client calls for Model Registry
  • area/models: MLmodel format, model serialization/deserialization, flavors
  • area/recipes: Recipes, Recipe APIs, Recipe configs, Recipe Templates
  • area/projects: MLproject format, project running backends
  • area/scoring: MLflow Model server, model deployment tools, Spark UDFs
  • area/server-infra: MLflow Tracking server backend
  • area/tracking: Tracking Service, tracking client APIs, autologging

What interface(s) does this bug affect?

  • area/uiux: Front-end, user experience, plotting, JavaScript, JavaScript dev server
  • area/docker: Docker use across MLflow’s components, such as MLflow Projects and MLflow Models
  • area/sqlalchemy: Use of SQLAlchemy in the Tracking Service or Model Registry
  • area/windows: Windows support

What language(s) does this bug affect?

  • language/r: R APIs and clients
  • language/java: Java APIs and clients
  • language/new: Proposals for new client languages

What integration(s) does this bug affect?

  • integrations/azure: Azure and Azure ML integrations
  • integrations/sagemaker: SageMaker integrations
  • integrations/databricks: Databricks integrations

Issue Analytics

  • State:open
  • Created 10 months ago
  • Comments:10 (6 by maintainers)

github_iconTop GitHub Comments

1reaction
mlflow-automationcommented, Nov 19, 2022

@BenWilson2 @dbczumar @harupy @WeichenXu123 Please assign a maintainer and start triaging this issue.

1reaction
harupycommented, Nov 15, 2022

@paprocki-r

We’re considering adding a boolean flag indicating that the model outputs probabilities:

mlflow.evaluate(..., evaluator_config={"output_probabilities": True})

Please let us know hear your thoughts on this.

Read more comments on GitHub >

github_iconTop Results From Across the Web

MLflow Models — MLflow 2.0.1 documentation
All of the flavors that a particular model supports are defined in its MLmodel file in ... The keras model flavor enables logging...
Read more >
How to Use MLflow to Experiment a Keras Network Model
In this blog, we demonstrate how to use MLflow to experiment Keras Models. In particular, we build and experiment with a binary classifier ......
Read more >
Build, train and evaluate models with TensorFlow Decision ...
TensorFlow Decision Forests (TF-DF) is a library for the training, evaluation, interpretation and inference of Decision Forest models.
Read more >
classification metrics can't handle a mix of multiclass and ...
I am trying to train my model on an image data set by using the command ... mlflow/mlflow[FR] Enable support for mlflow evaluate()...
Read more >
Available CRAN Packages By Date of Publication
2022-12-17, familiar, End-to-End Automated Machine Learning and Model Evaluation. 2022-12-17, ggpmisc, Miscellaneous Extensions to 'ggplot2'.
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found