Using make_scorer() for a GridSearchCV scoring parameter in a clustering task
See original GitHub issue* Workflow:
1- Consider make_scorer()
below for a clustering metric:
from sklearn.metrics import homogeneity_score, make_scorer
def score_func(y_true, y_pred, **kwargs):
return homogeneity_score(y_true, y_pred)
scorer = make_scorer(score_func)
2- Consider the simple method optics():
# "optics" algorithm for clustering
# ---
def optics(data, labels):
# data: A dataframe with two columns (x, y)
preds = None
base_opt = OPTICS()
grid_search_params = {"min_samples":np.arange(10),
"metric":["cityblock", "cosine", "euclidean", "l1", "l2", "manhattan"],
"cluster_method":["xi", "dbscan"],
"algorithm":["auto", "ball_tree", "kd_tree", "brute"]}
grid_search_cv = GridSearchCV(estimator=base_opt,
param_grid=grid_search_params,
scoring=scorer)
grid_search_cv.fit(data)
opt = grid_search_cv.best_estimator_
opt.fit(data)
preds = opt.labels_
# return clusters corresponding to (x, y) pairs according to "optics" algorithm
return preds
Running the optics()
led to this error:
TypeError: _score() missing 1 required positional argument: 'y_true'
Even by using grid_search_cv.fit(data, labels)
instead of grid_search_cv.fit(data)
, another exception rised:
AttributeError: 'OPTICS' object has no attribute 'predict'
I thinks we cannot use make_scorer()
with a GridSearchCV
for a clustering task.
* Proposed solution:
The fit()
method of GridSearchCV
automatically handles the type of the estimator which passed to its constructor, for example, for a clustering estimator it considers labels_
instead of predict()
for scoring.
Issue Analytics
- State:
- Created 3 years ago
- Comments:7 (4 by maintainers)
Top Results From Across the Web
How to use a custom scoring function in GridSearchCV for ...
I want to grid search over a set of hyper parameters to tune a clustering model. GridSearchCV offers a ...
Read more >3.5. Model evaluation: quantifying the quality of predictions
1. The scoring parameter: defining model evaluation rules¶. Model selection and evaluation using tools, such as grid_search.GridSearchCV and cross_validation ...
Read more >How to Grid Search Hyperparameters for Deep Learning ...
This is a map of the model parameter name and an array of values to try. By default, accuracy is the score that...
Read more >Parameter estimation using grid search with cross-validation
GridSearchCV object on a development set that comprises only half of the available labeled data. The performance of the selected hyper-parameters and trained ......
Read more >Guide on Hyperparameter Tuning Using GridSearchCV | Kaggle
Guide on Hyperparameter Tuning Using GridSearchCV ... X, y, cv=4, scoring = make_scorer(mean_squared_error, greater_is_better=False)).mean()). Parameters ...
Read more >Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start FreeTop Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Top GitHub Comments
As @amueller mentioned, having the scorer call
fit_predict
is probably not what you want to do, since it’d be ignoring your training set. So an algorithm such as OPTICS may not be a good example for this usecase.Consider this code:
It’ll raise:
which is very sensible, since
predict
is not really defined forOPTICS
. Now if you replace it withKMeans
:it works fine. Since
predict
is well-defined for kmeans.Now in case we don’t have the labels, we could have something like:
This raises:
I think we should either support this case, or raise a more informative error. WDYT @amueller ?
@amueller
Motivation: Search in the parameter space to find the best parameters choice for optics (or dbscan) model.
Goal: Finding the best parameters (w.r.t. the parameters grid
grid_search_params
) for a clustering estimator, with or without labels (in my case I have labels).Consider this:
my
custom_grid_search_cv
logic > ~ For each possible choice of parameters from the parameters grid space, sayp
: ~~ Applyp
to the estimator. ~~ For i=1…K, I’ve used i-th fold (current test set) of K-folds (in a K-fold splitting) to fit the estimator, then get the labels of the estimator (predict) and finally compute a clustering metric to judge the model prediction strength for the i-th fold. ~~ Average the metrics for all folds yieldsp
score. ~~ If currentp
score is better than the score of last choice of it, we store currentp
, saybest_params
. ~ Applybest_params
to the estimator and return that estimator.