question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

CV integration for OOB-scoring

See original GitHub issue

Describe the workflow you want to enable

Out-of-Bag (OOB) scoring provides an estimate of the model generalizability for RandomForest without needing to refit the model several times as is demanded by k-fold cross validation (CV). Although sklearn provides a mechanism to obtain this estimate, it does not provide a mechanism to integrate it into the existing cross validation workflows. For example, we might have a GridSearchCV where we want to optimise hyperparameters for the forest, but only fitting once per parameter set. This in theory could be implemented using OOB error.

As far as I can, the two parameters of interest here are cv and scoring, both inputs to all the CV-related classes, and ultimately to cross_val_score(). scoring can be implemented easily enough using a custom scorer, since this has access to the final estimator and therefore the OOB error. What is problematic here is the cv argument, which requires that we split the dataset, and offers no alternative.

Describe your proposed solution

  • We add sklearn.metrics.oob, a scoring function that just returns the oob error on the trained classifier
  • We add sklearn.model_selection.IntegratedCV, which is a cross validator that does not split the data at all. ie IntegratedCV().split(X) will return X unchanged

With the combination of these two entities, users will be able to perform OOB-based cross-validation

Describe alternatives you’ve considered, if relevant

It is possible to apply general cross validation metrics to a RandomForest, such as k-folds. This is an alternative that already exists in sklearn today. However we are neglecting the significant (k times) speedup that could be obtained using OOB error.

Additional context

This question is notably discussed in these threads:

Issue Analytics

  • State:open
  • Created a year ago
  • Comments:6 (4 by maintainers)

github_iconTop GitHub Comments

1reaction
thomasjpfancommented, Jun 5, 2022

A quick example of doing this without adding a new splitter or function is:

from sklearn.model_selection import GridSearchCV
from sklearn.ensemble import RandomForestClassifier
from sklearn.datasets import make_classification
import numpy as np

X, y = make_classification(random_state=42)
cv = GridSearchCV(
    RandomForestClassifier(oob_score=True, random_state=0),
    {"max_features": ["sqrt", "log2"]},
    cv=[(np.arange(X.shape[0]), np.empty(0, dtype=int))],
    scoring=lambda est, X, y: est.oob_score_,
)
results = cv.fit(X, y)

There is a maintenance cost to adding a new scorer and a new splitter that should only be used with estimators that have a oob_score_ . If using the oob score is a good practice, then I prefer adding an example like the one above to show users how to do it.

0reactions
glemaitrecommented, Jun 10, 2022

I’m not sure that really matters because you wouldn’t use OOB for the final test set evaluation, only for model selection

That’s why I think this is interesting to show the difference and mention it as a potential gotcha.

Read more comments on GitHub >

github_iconTop Results From Across the Web

Integration Skills on Resume - Enhancv
Integration skills examples from real resumes. Integration skill set in 2023. ... Read through Integration skills keywords and build a job-winning resume.
Read more >
Single-cell identity definition using random forests ... - bioRxiv
scRFE was designed to enable straightforward integration as a part of any ... min_cells keep_small_categories. nJobs. oobScore. Step. Cv.
Read more >
Production Integration Technician Resume Example
Looking for resumes online? Search hundreds of thousands of real resumes samples from LiveCareer's Resume Example Directory, the largest publicly searchable ...
Read more >
Integrate Cast SDK into Your Web Sender App
Learn how to integrate the Cast SDK into a Web Sender app. ... <script src="https://www.gstatic.com/cv/js/sender/v1/cast_sender.js?
Read more >
Issues-scikit-learn/scikit-learn - PythonTechWorld
CV integration for OOB-scoring. 888. Describe the workflow you want to enable Out-of-Bag (OOB) scoring provides an estimate of the model generalizability ...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found