CV integration for OOB-scoring
See original GitHub issueDescribe the workflow you want to enable
Out-of-Bag (OOB) scoring provides an estimate of the model generalizability for RandomForest
without needing to refit the model several times as is demanded by k-fold cross validation (CV). Although sklearn
provides a mechanism to obtain this estimate, it does not provide a mechanism to integrate it into the existing cross validation workflows. For example, we might have a GridSearchCV
where we want to optimise hyperparameters for the forest, but only fitting once per parameter set. This in theory could be implemented using OOB error.
As far as I can, the two parameters of interest here are cv
and scoring
, both inputs to all the CV-related classes, and ultimately to cross_val_score()
. scoring
can be implemented easily enough using a custom scorer, since this has access to the final estimator and therefore the OOB error. What is problematic here is the cv
argument, which requires that we split the dataset, and offers no alternative.
Describe your proposed solution
- We add
sklearn.metrics.oob
, a scoring function that just returns the oob error on the trained classifier - We add
sklearn.model_selection.IntegratedCV
, which is a cross validator that does not split the data at all. ieIntegratedCV().split(X)
will returnX
unchanged
With the combination of these two entities, users will be able to perform OOB-based cross-validation
Describe alternatives you’ve considered, if relevant
It is possible to apply general cross validation metrics to a RandomForest
, such as k-folds. This is an alternative that already exists in sklearn today. However we are neglecting the significant (k
times) speedup that could be obtained using OOB error.
Additional context
This question is notably discussed in these threads:
Issue Analytics
- State:
- Created a year ago
- Comments:6 (4 by maintainers)
Top GitHub Comments
A quick example of doing this without adding a new splitter or function is:
There is a maintenance cost to adding a new scorer and a new splitter that should only be used with estimators that have a
oob_score_
. If using the oob score is a good practice, then I prefer adding an example like the one above to show users how to do it.That’s why I think this is interesting to show the difference and mention it as a potential gotcha.