Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

CV integration for OOB-scoring

See original GitHub issue

Describe the workflow you want to enable

Out-of-Bag (OOB) scoring provides an estimate of the model generalizability for RandomForest without needing to refit the model several times as is demanded by k-fold cross validation (CV). Although sklearn provides a mechanism to obtain this estimate, it does not provide a mechanism to integrate it into the existing cross validation workflows. For example, we might have a GridSearchCV where we want to optimise hyperparameters for the forest, but only fitting once per parameter set. This in theory could be implemented using OOB error.

As far as I can, the two parameters of interest here are cv and scoring, both inputs to all the CV-related classes, and ultimately to cross_val_score(). scoring can be implemented easily enough using a custom scorer, since this has access to the final estimator and therefore the OOB error. What is problematic here is the cv argument, which requires that we split the dataset, and offers no alternative.

Describe your proposed solution

We add sklearn.metrics.oob, a scoring function that just returns the oob error on the trained classifier
We add sklearn.model_selection.IntegratedCV, which is a cross validator that does not split the data at all. ie IntegratedCV().split(X) will return X unchanged

With the combination of these two entities, users will be able to perform OOB-based cross-validation

Describe alternatives you’ve considered, if relevant

It is possible to apply general cross validation metrics to a RandomForest, such as k-folds. This is an alternative that already exists in sklearn today. However we are neglecting the significant (k times) speedup that could be obtained using OOB error.

Additional context

This question is notably discussed in these threads:

Issue Analytics

State:
Created a year ago
Comments:6 (4 by maintainers)

Top GitHub Comments

1reaction

thomasjpfancommented, Jun 5, 2022

A quick example of doing this without adding a new splitter or function is:

from sklearn.model_selection import GridSearchCV
from sklearn.ensemble import RandomForestClassifier
from sklearn.datasets import make_classification
import numpy as np

X, y = make_classification(random_state=42)
cv = GridSearchCV(
    RandomForestClassifier(oob_score=True, random_state=0),
    {"max_features": ["sqrt", "log2"]},
    cv=[(np.arange(X.shape[0]), np.empty(0, dtype=int))],
    scoring=lambda est, X, y: est.oob_score_,
)
results = cv.fit(X, y)

There is a maintenance cost to adding a new scorer and a new splitter that should only be used with estimators that have a oob_score_ . If using the oob score is a good practice, then I prefer adding an example like the one above to show users how to do it.

0reactions

glemaitrecommented, Jun 10, 2022

I’m not sure that really matters because you wouldn’t use OOB for the final test set evaluation, only for model selection

That’s why I think this is interesting to show the difference and mention it as a potential gotcha.

Top Results From Across the Web

Integration Skills on Resume - Enhancv

Integration skills examples from real resumes. Integration skill set in 2023. ... Read through Integration skills keywords and build a job-winning resume.

Single-cell identity definition using random forests ... - bioRxiv

scRFE was designed to enable straightforward integration as a part of any ... min_cells keep_small_categories. nJobs. oobScore. Step. Cv.

Production Integration Technician Resume Example

Looking for resumes online? Search hundreds of thousands of real resumes samples from LiveCareer's Resume Example Directory, the largest publicly searchable ...

Integrate Cast SDK into Your Web Sender App

Learn how to integrate the Cast SDK into a Web Sender app. ... <script src="https://www.gstatic.com/cv/js/sender/v1/cast_sender.js?

Issues-scikit-learn/scikit-learn - PythonTechWorld

CV integration for OOB-scoring. 888. Describe the workflow you want to enable Out-of-Bag (OOB) scoring provides an estimate of the model generalizability ...