question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Improve GridSearchCV and splitter classes

See original GitHub issue

Edit: I have added for each bullet point the conclusions so far

Description

This is my first feature request so I apologise in advance if don’t follow the correct protocol. I have a modified version (not ready to be a PR, but happy to contribute if agreed) of various classes for the model_selection package addressing a number of features I wanted in GridSearchCV, who derives from BaseSearchCV. All of my changes apply to BaseSearchCV so they could apply also to RandomSearchCV.

  • (will be solved in 0.21)The parallelisation is carried out based on joblib parallel methods. This entails some problems (that I have experienced) and have been already reported with no clear solution. See, e.g. https://github.com/joblib/joblib/issues/125 or https://github.com/joblib/joblib/issues/480. My solution has been using concurrent.futures.ProcessPoolExecutor

  • [solved in PR https://github.com/scikit-learn/scikit-learn/pull/12613] The current _fit_and_score method used in BaseSearchCV doesn’t print the train score, i.e. the performance on the training set. I was interested in that, so added that print for higher verbosity in case of verbose > 3

  • [cannot be in until python3.6 is used. Next release will use 3.5] The current _fit_and_score method doesn’t use f-strings as most of the methods in sklearn. I have changed that as well, but maybe for back-compatibility that’s something you want to stay with

In addition to that I wanted some flexibility on how K-fold CV was done. The cv parameter in BaseSearchCV creates a Kfold object in case you pass an integer. One can otherwise pass a specific splitter class like the same Kfold, LeaveOneOut, etc. However, there are some issues:

  • (rejected, can be easily done by creating your own splitter. Better not to add many more classes) When you perform a split in K-folds, the train and test splits will be always K-1 for train, 1 for test. I wanted something more flexible so I created my own MarcFold (it’s just 40 lines indeed), where I can specify how many folds go for training and how many for testing. Then, all possible combinations of (train, test) are created. To the best of my knowledge, I don’t think there’s an implementation for that.

  • (rejected. After analyzing it better, it looks like the rest of splitters don’t need a public _RepeatedSplits. So if anyone wants to derive from it, do it hackily based on the private class) Related to this, I had to create my RepeatedMarcFold because there’s no way to pass MarcFold to something, unless you call the private class _RepeatedSplits. That could be otherwise corrected by making RepeatedSplits public or a less shy sibling class.

  • (rejected, can be done with ShuffleSplit) Also, I wanted something to specify the number of samples to be used for training, rather than the number of folds, as explained on the first bucket of this second list of buckets 😃 That parameter should be always lower than or equal the number of samples you have for training (given that first parameter and the data you have). Otherwise, it takes them all and shows a warning

So… that’s all folks! As I said, I have this implemented and would be happy to contribute with a PR, but also wanted to write this before starting with that (hopefully not too much) long process, to be sure that those features would be appreciated, they are not in current dev, etc. I didn’t find anything related to this in the issues, but also I didn’t check ALL the issues…

So please, let me know your feedback. Thanks!

Versions

System python: 3.6.1 |Continuum Analytics, Inc.| (default, May 11 2017, 13:09:58) [GCC 4.4.7 20120313 (Red Hat 4.4.7-1)] executable: /home/marc/anaconda3/envs/ML/bin/python machine: Linux-4.15.0-38-generic-x86_64-with-debian-buster-sid

BLAS macros: SCIPY_MKL_H=None, HAVE_CBLAS=None lib_dirs: /home/marc/anaconda3/envs/ML/lib cblas_libs: mkl_rt, pthread

Python deps pip: 18.1 setuptools: 40.5.0 sklearn: 0.20.0 numpy: 1.12.1 scipy: 1.1.0 Cython: 0.29 pandas: 0.23.4

Issue Analytics

  • State:closed
  • Created 5 years ago
  • Comments:10 (10 by maintainers)

github_iconTop GitHub Comments

1reaction
cmarmocommented, Feb 6, 2022

Could this issue be closed as solved in #12613? If I understand correctly all items in the description are solved or rejected.

1reaction
ogriselcommented, Nov 16, 2018

I don’t think this is an edge case. The key point here is that when you use conventional K-fold CV, so training with K-1 folds, the objective is to get an estimate of how your model would perform if you trained with all the data you have now. However, there are real use cases (I actually use this at work) where you know that you won’t have that much data, so you constrain in purpose your training data to be less than the maximum. In addition to that, having more testing data reduces also the variance of your accuracy estimate indeed.

You can use ShuffleSplit(train_size=0.2, test_size=0.2) to simulate arbitrarily small training and test sets, without any class label stratification. If you have more complex, exotic use case than this, then implementing you own CV strategy is actually the best thing to do.

Not that you can also actually pass integer absolute number of samples instead of floating point relative fractions: ShuffleSplit(train_size=200, test_size=200).

Read more comments on GitHub >

github_iconTop Results From Across the Web

is it possible to set the splitting strategy for GridSearchCv?
Yes, pass in the GridSearchCV as cv a StratifiedKFold object. ... Or you can define your own splitter class according to your criteria, ......
Read more >
3.2. Tuning the hyper-parameters of an estimator
The grid search provided by GridSearchCV exhaustively generates candidates from a grid of parameter values specified with the param_grid parameter.
Read more >
Tuning the Hyperparameters of your Machine Learning ...
This process of splitting your data into k-folds and using 1 fold for testing and k-1 fold for testing is known as k-fold...
Read more >
Hyperparameters Tuning Using GridSearchCV And ...
Model Hyperparameter tuning is very useful to enhance the performance of a machine learning model. We have discussed both the approaches to do ......
Read more >
Hyperparameter tuning by grid-search — Scikit-learn course
The GridSearchCV estimator takes a param_grid parameter which defines all hyperparameters and their associated values. The grid-search will be in charge of ...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found