Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Store the OOB Loss for `GradientBoostingClassifier`

See original GitHub issue

Describe the workflow you want to enable

Currently the only OOB-related performance metric we store on GradientBoostingClassifier is oob_improvement_, which is an array of OOB loss decreases per iteration. However, it would also be useful to track the actual OOB loss values for each iteration. This can be used as an estimate of the generalization error, and might bypass the need for cross validation in some cases. This would also help it integrate into my #23391 framework.

Describe your proposed solution

I propose we add a new attribute: oob_score_ (or alternatively oob_loss_) to GradientBoostingClassifier. This would only be set in cases where subsample < 1. It would be updated in each iteration. Currently we are already calculating this, we just throw it away: https://github.com/scikit-learn/scikit-learn/blob/32f9deaaf27c7ae56898222be9d820ba0fd1054f/sklearn/ensemble/_gb.py#L758-L768

Describe alternatives you’ve considered, if relevant

You might think that the cumsum of the oob_improvement_ would give us the loss values, and this is almost true, except for the fact that we need the OOB loss for the first iteration, which isn’t stored anywhere. So this doesn’t solve the issue.

Additional context

No response

Issue Analytics

State:
Created a year ago
Comments:6 (6 by maintainers)

Top GitHub Comments

1reaction

ogriselcommented, Jun 2, 2022

I am fine with storing:

oob_scores_  # with the full history
oob_score_  # as a convenience for the last element of the previous array.

0reactions

awinmlcommented, Nov 3, 2022

/take

Top Results From Across the Web

Gradient Boosting Out-of-Bag estimates

The OOB estimator is a pessimistic estimator of the true test loss, but remains a fairly good approximation for a small number of...

Scikit Learn Random forest classifier: How to produce a ...

In order to see how many trees are necessary in my forest, I'd like to plot the OOB error as the number of...

Understanding Gradient Boosting Machines

The loss function is a measure indicating how good are model's coefficients are at fitting the underlying data. A logical understanding of ...

Out-of-bag error estimate for boosting? - Cross Validated

I do not know how this idea works precisely, but from what I gathered, the oob sample for the current tree is used...

Gradient Boosting | Hyperparameter Tuning Python

It refers to the loss function to be minimized in each split. ... IDcol]] gbm0 = GradientBoostingClassifier(random_state=10) modelfit(gbm0, ...

Troubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.

Start Free

Top Related Reddit Thread

No results found

Top Related Tweet

No results found

Top Related Dev.to Post

No results found

Store the OOB Loss for `GradientBoostingClassifier`

Describe the workflow you want to enable

Describe your proposed solution

Describe alternatives you’ve considered, if relevant

Additional context

Issue Analytics

Top GitHub Comments

Top Results From Across the Web

Top Related Medium Post

Top Related StackOverflow Question

Troubleshoot Live Code

Top Related Reddit Thread

Top Related Hackernoon Post

Top Related Tweet

Top Related Dev.to Post

Top Related Hashnode Post

Inconsistent numbers of samples issue with fit_params in CalibratedClassifierCV

`DecisionTreeClassifier` became slower in v1.1 when fitting encoded variables