Improve GridSearchCV and splitter classes
See original GitHub issueEdit: I have added for each bullet point the conclusions so far
Description
This is my first feature request so I apologise in advance if don’t follow the correct protocol. I have a modified version (not ready to be a PR, but happy to contribute if agreed) of various classes for the model_selection
package addressing a number of features I wanted in GridSearchCV
, who derives from BaseSearchCV
. All of my changes apply to BaseSearchCV
so they could apply also to
RandomSearchCV
.
-
(will be solved in 0.21)The parallelisation is carried out based on joblib parallel methods. This entails some problems (that I have experienced) and have been already reported with no clear solution. See, e.g. https://github.com/joblib/joblib/issues/125 or https://github.com/joblib/joblib/issues/480. My solution has been using
concurrent.futures.ProcessPoolExecutor
-
[solved in PR https://github.com/scikit-learn/scikit-learn/pull/12613] The current
_fit_and_score
method used inBaseSearchCV
doesn’t print the train score, i.e. the performance on the training set. I was interested in that, so added that print for higher verbosity in case ofverbose > 3
-
[cannot be in until python3.6 is used. Next release will use 3.5] The current
_fit_and_score
method doesn’t use f-strings as most of the methods in sklearn. I have changed that as well, but maybe for back-compatibility that’s something you want to stay with
In addition to that I wanted some flexibility on how K-fold CV was done. The cv
parameter in BaseSearchCV
creates a Kfold
object in case you pass an integer. One can otherwise pass a specific splitter class like the same Kfold
, LeaveOneOut
, etc. However, there are some issues:
-
(rejected, can be easily done by creating your own splitter. Better not to add many more classes) When you perform a split in K-folds, the train and test splits will be always K-1 for train, 1 for test. I wanted something more flexible so I created my own
MarcFold
(it’s just 40 lines indeed), where I can specify how many folds go for training and how many for testing. Then, all possible combinations of(train, test)
are created. To the best of my knowledge, I don’t think there’s an implementation for that. -
(rejected. After analyzing it better, it looks like the rest of splitters don’t need a public
_RepeatedSplits
. So if anyone wants to derive from it, do it hackily based on the private class) Related to this, I had to create myRepeatedMarcFold
because there’s no way to passMarcFold
to something, unless you call the private class_RepeatedSplits
. That could be otherwise corrected by making RepeatedSplits public or a less shy sibling class. -
(rejected, can be done with ShuffleSplit) Also, I wanted something to specify the number of samples to be used for training, rather than the number of folds, as explained on the first bucket of this second list of buckets 😃 That parameter should be always lower than or equal the number of samples you have for training (given that first parameter and the data you have). Otherwise, it takes them all and shows a warning
So… that’s all folks! As I said, I have this implemented and would be happy to contribute with a PR, but also wanted to write this before starting with that (hopefully not too much) long process, to be sure that those features would be appreciated, they are not in current dev, etc. I didn’t find anything related to this in the issues, but also I didn’t check ALL the issues…
So please, let me know your feedback. Thanks!
Versions
System python: 3.6.1 |Continuum Analytics, Inc.| (default, May 11 2017, 13:09:58) [GCC 4.4.7 20120313 (Red Hat 4.4.7-1)] executable: /home/marc/anaconda3/envs/ML/bin/python machine: Linux-4.15.0-38-generic-x86_64-with-debian-buster-sid
BLAS macros: SCIPY_MKL_H=None, HAVE_CBLAS=None lib_dirs: /home/marc/anaconda3/envs/ML/lib cblas_libs: mkl_rt, pthread
Python deps pip: 18.1 setuptools: 40.5.0 sklearn: 0.20.0 numpy: 1.12.1 scipy: 1.1.0 Cython: 0.29 pandas: 0.23.4
Issue Analytics
- State:
- Created 5 years ago
- Comments:10 (10 by maintainers)
Top GitHub Comments
Could this issue be closed as solved in #12613? If I understand correctly all items in the description are solved or rejected.
You can use
ShuffleSplit(train_size=0.2, test_size=0.2)
to simulate arbitrarily small training and test sets, without any class label stratification. If you have more complex, exotic use case than this, then implementing you own CV strategy is actually the best thing to do.Not that you can also actually pass integer absolute number of samples instead of floating point relative fractions:
ShuffleSplit(train_size=200, test_size=200)
.