Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Using make_scorer() for a GridSearchCV scoring parameter in a clustering task

See original GitHub issue

* Workflow:

1- Consider make_scorer() below for a clustering metric:

from sklearn.metrics import homogeneity_score, make_scorer

def score_func(y_true, y_pred, **kwargs):
    return homogeneity_score(y_true, y_pred)
scorer = make_scorer(score_func)

2- Consider the simple method optics():

# "optics" algorithm for clustering
# ---
def optics(data, labels):
    # data: A dataframe with two columns (x, y)
    preds = None    
    base_opt = OPTICS()
    grid_search_params = {"min_samples":np.arange(10),
                          "metric":["cityblock", "cosine", "euclidean", "l1", "l2", "manhattan"],
                          "cluster_method":["xi", "dbscan"],
                          "algorithm":["auto", "ball_tree", "kd_tree", "brute"]}
    
    grid_search_cv = GridSearchCV(estimator=base_opt,
                                  param_grid=grid_search_params,
                                  scoring=scorer)
    
    grid_search_cv.fit(data)    
    opt = grid_search_cv.best_estimator_
    opt.fit(data)
    preds = opt.labels_
    
    # return clusters corresponding to (x, y) pairs according to "optics" algorithm
    return preds

Running the optics() led to this error: TypeError: _score() missing 1 required positional argument: 'y_true'

Even by using grid_search_cv.fit(data, labels) instead of grid_search_cv.fit(data), another exception rised: AttributeError: 'OPTICS' object has no attribute 'predict'

I thinks we cannot use make_scorer() with a GridSearchCV for a clustering task.

* Proposed solution:

The fit() method of GridSearchCV automatically handles the type of the estimator which passed to its constructor, for example, for a clustering estimator it considers labels_ instead of predict() for scoring.

Issue Analytics

State:
Created 3 years ago
Comments:7 (4 by maintainers)

Top GitHub Comments

1reaction

adrinjalalicommented, Jul 10, 2020

As @amueller mentioned, having the scorer call fit_predict is probably not what you want to do, since it’d be ignoring your training set. So an algorithm such as OPTICS may not be a good example for this usecase.

Consider this code:

# %%
from sklearn.metrics import homogeneity_score, make_scorer

def score_func(y_true, y_pred, **kwargs):
    return homogeneity_score(y_true, y_pred)
scorer = make_scorer(score_func)

# %%
import numpy as np
from sklearn.model_selection import GridSearchCV
from sklearn.cluster import OPTICS
from sklearn.datasets import make_classification

X, y = make_classification()

base_opt = OPTICS()
grid_search_params = {"min_samples":np.arange(10),
                        "metric":["cityblock", "cosine", "euclidean", "l1", "l2", "manhattan"],
                        "cluster_method":["xi", "dbscan"],
                        "algorithm":["auto", "ball_tree", "kd_tree", "brute"]}

grid_search_cv = GridSearchCV(estimator=base_opt,
                                param_grid=grid_search_params,
                                scoring=scorer)

grid_search_cv.fit(X, y)

It’ll raise:

AttributeError: 'OPTICS' object has no attribute 'predict'

which is very sensible, since predict is not really defined for OPTICS. Now if you replace it with KMeans:

base_opt = KMeans()
grid_search_params = {"n_clusters":np.arange(10)}

grid_search_cv = GridSearchCV(estimator=base_opt,
                                param_grid=grid_search_params,
                                scoring=scorer)

grid_search_cv.fit(X, y)

it works fine. Since predict is well-defined for kmeans.

Now in case we don’t have the labels, we could have something like:

import numpy as np
from sklearn.model_selection import GridSearchCV
from sklearn.cluster import OPTICS
from sklearn.datasets import make_classification
from sklearn.metrics import silhouette_score, make_scorer
scorer = make_scorer(silhouette_score)
X, y = make_classification()

base_opt = OPTICS()
grid_search_params = {"min_samples":np.arange(10),
                        "metric":["cityblock", "cosine", "euclidean", "l1", "l2", "manhattan"],
                        "cluster_method":["xi", "dbscan"],
                        "algorithm":["auto", "ball_tree", "kd_tree", "brute"]}

grid_search_cv = GridSearchCV(estimator=base_opt,
                                param_grid=grid_search_params,
                                scoring=scorer)

grid_search_cv.fit(X)

This raises:

TypeError: _score() missing 1 required positional argument: 'y_true'

I think we should either support this case, or raise a more informative error. WDYT @amueller ?

0reactions

imanirajiancommented, Jul 12, 2020

@amueller

But tbh I think that’s a very strange thing to do. What is the motivation of using cross-validation in this setting?

Motivation: Search in the parameter space to find the best parameters choice for optics (or dbscan) model.

…what you’d expect it to do.

What do you want to do

Goal: Finding the best parameters (w.r.t. the parameters grid grid_search_params) for a clustering estimator, with or without labels (in my case I have labels).

There is no notion of training and test set in your code

...
for k, (train_indices, test_indices) in enumerate(k_fold.split(data)):
    data_train = data.iloc[train_indices]
    data_test = data.iloc[test_indices]
...

And the way you define training and test score are confusing

Consider this:

def score_func(y_true, y_pred, **kwargs):
    return homogeneity_score(y_true, y_pred)

my custom_grid_search_cv logic > ~ For each possible choice of parameters from the parameters grid space, say p: ~~ Apply p to the estimator. ~~ For i=1…K, I’ve used i-th fold (current test set) of K-folds (in a K-fold splitting) to fit the estimator, then get the labels of the estimator (predict) and finally compute a clustering metric to judge the model prediction strength for the i-th fold. ~~ Average the metrics for all folds yields p score. ~~ If current p score is better than the score of last choice of it, we store current p, say best_params. ~ Apply best_params to the estimator and return that estimator.

Top Results From Across the Web

How to use a custom scoring function in GridSearchCV for ...

I want to grid search over a set of hyper parameters to tune a clustering model. GridSearchCV offers a ...

3.5. Model evaluation: quantifying the quality of predictions

1. The scoring parameter: defining model evaluation rules¶. Model selection and evaluation using tools, such as grid_search.GridSearchCV and cross_validation ...

How to Grid Search Hyperparameters for Deep Learning ...

This is a map of the model parameter name and an array of values to try. By default, accuracy is the score that...

Parameter estimation using grid search with cross-validation

GridSearchCV object on a development set that comprises only half of the available labeled data. The performance of the selected hyper-parameters and trained ......

Guide on Hyperparameter Tuning Using GridSearchCV | Kaggle

Guide on Hyperparameter Tuning Using GridSearchCV ... X, y, cv=4, scoring = make_scorer(mean_squared_error, greater_is_better=False)).mean()). Parameters ...