[FR] Enable support for mlflow evaluate() on Keras multiclass models
See original GitHub issueIssues Policy acknowledgement
- I have read and agree to submit bug reports in accordance with the issues policy
Willingness to contribute
No. I cannot contribute a bug fix at this time.
MLflow version
- Client: 1.30.0
System information
- OS Platform and Distribution (e.g., Linux Ubuntu 16.04): Ubuntu 20.04.5 LTS
- Python version: 3.8.10
Describe the problem
Hardcoded metrics in default evaluator forbid from using mlflow.evaluate() on tf.keras multiclass model. It works well for sklearn multiclass models or xgboost and so on.
it has something to do with the fact that this doesn’t work:
from sklearn.metrics import accuracy_score
y_pred = [[0.5, 1], [-1, 1], [7, -6]]
y_true = [[0, 2], [-1, 2], [8, -5]]
accuracy_score(y_true, y_pred)
and how it should be used in that case:
accuracy_score(np.array([[0, 1], [1, 1]]), np.ones((2, 2)))
I suppose the way to work it around would be to write my own evaluator. However, it feels that it could be solved if our own metrics could be provided, e.g. through evaluator_config:Dict
Tracking information
No response
Code to reproduce issue
import numpy as np
import matplotlib.pyplot as plt
from tensorflow.keras.utils import to_categorical
from keras.datasets import fashion_mnist
from keras.models import Sequential, Model
from keras.layers import Dense, Input
# Import fashion MNIST dataset
(X_train, y_train), (X_test, y_test) = fashion_mnist.load_data()
# Display the first 7 images
fig, axes = plt.subplots(ncols=7, sharex=False,
sharey=True, figsize=(16, 4))
for i in range(7):
axes[i].set_title(y_train[i])
axes[i].imshow(X_train[i], cmap='gray')
axes[i].get_xaxis().set_visible(False)
axes[i].get_yaxis().set_visible(False)
plt.show()
print("Original shape of X_train =", X_train.shape)
print("Original shape of X_test =", X_test.shape, end='\n')
# Reshape X_train to (60000, 784) and X_test to (10000, 784)
X_train = X_train.reshape(X_train.shape[0], X_train.shape[1]*X_train.shape[2])
X_test = X_test.reshape(X_test.shape[0], X_test.shape[1]*X_test.shape[2])
print("New X_train shape", X_train.shape)
print("New X_test shape", X_test.shape, end='\n')
# Convert target (y_train and y_test) into one-hot
temp = []
for i in range(len(y_train)):
temp.append(to_categorical(y_train[i], num_classes=10))
y_train = np.array(temp)
temp = []
for i in range(len(y_test)):
temp.append(to_categorical(y_test[i], num_classes=10))
y_test = np.array(temp)
# Create and train sequential model
model_seq = Sequential()
model_seq.add(Dense(5, activation='sigmoid', input_shape=(X_train.shape[1],)))
model_seq.add(Dense(4, activation='sigmoid'))
model_seq.add(Dense(10, activation='softmax'))
model_seq.summary()
model_seq.compile(loss='categorical_crossentropy',
optimizer='adam',
metrics=['acc'])
model_seq.fit(X_train, y_train, epochs=3,
validation_data=(X_test,y_test))
# Create and train functional model
input1 = Input(shape=(X_train.shape[1],))
hidden1 = Dense(5, activation='sigmoid')(input1)
hidden2 = Dense(4, activation='sigmoid')(hidden1)
output = Dense(10, activation='softmax')(hidden2)
model_func = Model(inputs=input1, outputs=output)
model_func.summary()
model_func.compile(loss='categorical_crossentropy',
optimizer='adam',
metrics=['acc'])
model_func.fit(X_train, y_train, epochs=3,
validation_data=(X_test,y_test))
# create eval pd.df
targets = np.where(y_test == np.amax(y_test))[1]
df = pd.DataFrame(X_test)
df['target'] = targets
#mlflow evaluation
with mlflow.start_run() as run:
model_info = mlflow.sklearn.log_model(model_func, "model")
result = mlflow.evaluate(
model_info.model_uri,
df,
targets="target",
model_type="classifier",
dataset_name="adult",
evaluators=["default"],
)
Stack trace
ValueError: Classification metrics can't handle a mix of multiclass and continuous-multioutput targets
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<command-1847767594277258> in <module>
83 with mlflow.start_run() as run:
84 model_info = mlflow.sklearn.log_model(model_func, "model")
---> 85 result = mlflow.evaluate(
86 model_info.model_uri,
87 df,
/databricks/python/lib/python3.8/site-packages/mlflow/models/evaluation/base.py in evaluate(model, data, targets, model_type, dataset_name, dataset_path, feature_names, evaluators, evaluator_config, custom_metrics, validation_thresholds, baseline_model, env_manager)
1241 with _start_run_or_reuse_active_run() as run_id:
1242 try:
-> 1243 evaluate_result = _evaluate(
1244 model=model,
1245 model_type=model_type,
/databricks/python_shell/dbruntime/MLWorkloadsInstrumentation/_evaluation.py in patched_evaluate(model, model_type, dataset, run_id, evaluator_name_list, evaluator_name_to_conf_map, **kwargs)
38 try:
39 original_succeeded = False
---> 40 original_result = original_evaluate_fn(
41 model=model,
42 model_type=model_type,
/databricks/python/lib/python3.8/site-packages/mlflow/models/evaluation/base.py in _evaluate(model, model_type, dataset, run_id, evaluator_name_list, evaluator_name_to_conf_map, custom_metrics, baseline_model)
814 if evaluator.can_evaluate(model_type=model_type, evaluator_config=config):
815 _logger.info(f"Evaluating the model with the {evaluator_name} evaluator.")
--> 816 eval_result = evaluator.evaluate(
817 model=model,
818 model_type=model_type,
/databricks/python/lib/python3.8/site-packages/mlflow/models/evaluation/default_evaluator.py in evaluate(self, model, model_type, dataset, run_id, evaluator_config, custom_metrics, baseline_model, **kwargs)
1226 if baseline_model:
1227 _logger.info("Evaluating candidate model:")
-> 1228 evaluation_result = self._evaluate(model, is_baseline_model=False)
1229
1230 if not baseline_model:
/databricks/python/lib/python3.8/site-packages/mlflow/models/evaluation/default_evaluator.py in _evaluate(self, model, is_baseline_model, **kwargs)
1175 with mlflow.utils.autologging_utils.disable_autologging():
1176 self._generate_model_predictions()
-> 1177 self._compute_builtin_metrics()
1178 self._evaluate_custom_metrics_and_log_produced_artifacts(
1179 log_to_mlflow_tracking=not is_baseline_model
/databricks/python/lib/python3.8/site-packages/mlflow/models/evaluation/default_evaluator.py in _compute_builtin_metrics(self)
1114 average = self.evaluator_config.get("average", "weighted")
1115 self.metrics.update(
-> 1116 _get_multiclass_classifier_metrics(
1117 y_true=self.y,
1118 y_pred=self.y_pred,
/databricks/python/lib/python3.8/site-packages/mlflow/models/evaluation/default_evaluator.py in _get_multiclass_classifier_metrics(y_true, y_pred, y_proba, labels, average, sample_weights)
225 sample_weights=None,
226 ):
--> 227 return _get_common_classifier_metrics(
228 y_true=y_true,
229 y_pred=y_pred,
/databricks/python/lib/python3.8/site-packages/mlflow/models/evaluation/default_evaluator.py in _get_common_classifier_metrics(y_true, y_pred, y_proba, labels, average, pos_label, sample_weights)
164 metrics = {
165 "example_count": len(y_true),
--> 166 "accuracy_score": sk_metrics.accuracy_score(y_true, y_pred, sample_weight=sample_weights),
167 "recall_score": sk_metrics.recall_score(
168 y_true,
/databricks/python/lib/python3.8/site-packages/sklearn/utils/validation.py in inner_f(*args, **kwargs)
61 extra_args = len(args) - len(all_args)
62 if extra_args <= 0:
---> 63 return f(*args, **kwargs)
64
65 # extra_args > 0
/databricks/python/lib/python3.8/site-packages/sklearn/metrics/_classification.py in accuracy_score(y_true, y_pred, normalize, sample_weight)
200
201 # Compute accuracy for each possible representation
--> 202 y_type, y_true, y_pred = _check_targets(y_true, y_pred)
203 check_consistent_length(y_true, y_pred, sample_weight)
204 if y_type.startswith('multilabel'):
/databricks/python/lib/python3.8/site-packages/sklearn/metrics/_classification.py in _check_targets(y_true, y_pred)
90
91 if len(y_type) > 1:
---> 92 raise ValueError("Classification metrics can't handle a mix of {0} "
93 "and {1} targets".format(type_true, type_pred))
94
ValueError: Classification metrics can't handle a mix of multiclass and continuous-multioutput targets
Other info / logs
No response
What component(s) does this bug affect?
-
area/artifacts
: Artifact stores and artifact logging -
area/build
: Build and test infrastructure for MLflow -
area/docs
: MLflow documentation pages -
area/examples
: Example code -
area/model-registry
: Model Registry service, APIs, and the fluent client calls for Model Registry -
area/models
: MLmodel format, model serialization/deserialization, flavors -
area/recipes
: Recipes, Recipe APIs, Recipe configs, Recipe Templates -
area/projects
: MLproject format, project running backends -
area/scoring
: MLflow Model server, model deployment tools, Spark UDFs -
area/server-infra
: MLflow Tracking server backend -
area/tracking
: Tracking Service, tracking client APIs, autologging
What interface(s) does this bug affect?
-
area/uiux
: Front-end, user experience, plotting, JavaScript, JavaScript dev server -
area/docker
: Docker use across MLflow’s components, such as MLflow Projects and MLflow Models -
area/sqlalchemy
: Use of SQLAlchemy in the Tracking Service or Model Registry -
area/windows
: Windows support
What language(s) does this bug affect?
-
language/r
: R APIs and clients -
language/java
: Java APIs and clients -
language/new
: Proposals for new client languages
What integration(s) does this bug affect?
-
integrations/azure
: Azure and Azure ML integrations -
integrations/sagemaker
: SageMaker integrations -
integrations/databricks
: Databricks integrations
Issue Analytics
- State:
- Created 10 months ago
- Comments:10 (6 by maintainers)
Top Results From Across the Web
MLflow Models — MLflow 2.0.1 documentation
All of the flavors that a particular model supports are defined in its MLmodel file in ... The keras model flavor enables logging...
Read more >How to Use MLflow to Experiment a Keras Network Model
In this blog, we demonstrate how to use MLflow to experiment Keras Models. In particular, we build and experiment with a binary classifier ......
Read more >Build, train and evaluate models with TensorFlow Decision ...
TensorFlow Decision Forests (TF-DF) is a library for the training, evaluation, interpretation and inference of Decision Forest models.
Read more >classification metrics can't handle a mix of multiclass and ...
I am trying to train my model on an image data set by using the command ... mlflow/mlflow[FR] Enable support for mlflow evaluate()...
Read more >Available CRAN Packages By Date of Publication
2022-12-17, familiar, End-to-End Automated Machine Learning and Model Evaluation. 2022-12-17, ggpmisc, Miscellaneous Extensions to 'ggplot2'.
Read more >
Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free
Top Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
@BenWilson2 @dbczumar @harupy @WeichenXu123 Please assign a maintainer and start triaging this issue.
@paprocki-r
We’re considering adding a boolean flag indicating that the model outputs probabilities:
Please let us know hear your thoughts on this.