question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

OOB calculation in Random Forests seems to use incorrect indices

See original GitHub issue

update 9 March 2021

After the discussion below and further testing, I came to the conclusion that this is not a bug after all. The indices are computed correctly, because _make_estimator always generates a new random seed that is stored in the random_state variable of each tree – even if a np.random.RandomState() object was passed to _make_estimator.

Apologies for the false alarm. I got confused by the interchangeable use of random_state as a seed or a random generator (but, interestingly, not as the actual state of the random generator).

original report

I have been looking into the OOB score calculation for random forests, and I am concerned that the indices considered to be “out-of-bag” are not actually the same as those held out for training. A few years ago, PR https://github.com/scikit-learn/scikit-learn/issues/4783 was implemented, which replaced the explicit storage of indices by a regeneration procedure.

Specifically, this resulted in the following code fragment in _compute_oob_predictions(self, X, y) in sklearn/ensemble/_forest.py:

for estimator in self.estimators_:
    unsampled_indices = _generate_unsampled_indices(
        estimator.random_state, n_samples, n_samples_bootstrap,
    )

    y_pred = self._get_oob_predictions(
        estimator, X[unsampled_indices, :]
    )
    oob_pred[unsampled_indices, ...] += y_pred
    n_oob_pred[unsampled_indices, :] += 1

If estimator.random_state is a np.RandomState object instead of a constant seed then this appears to generate new random indices instead of the original set of indices. I noticed that this potential shortcoming was also observed by @amueller in the discussion about https://github.com/scikit-learn/scikit-learn/issues/4783, but no follow-up was included there:

Silly question: This fails if a np.RandomState was passed to the forest, right? Because then the random_state would have been altered by building the tree? (or I’m tired)

_Originally posted by @amueller in https://github.com/scikit-learn/scikit-learn/issues/4783#issuecomment-236036188_

I think this is more generally a concern, because even if a constant seed was passed to the RandomForest initializer, then BaseForest.fit(…) implements random_state = check_random_state(self.random_state) which makes random_state a RandomState() object, followed by (a few lines below): trees = [self._make_estimator(append=False, random_state=random_state) for i in range(n_more_estimators)] All trees appear to share the same RandomState() instance, which is then used to generate indices. Calling _generate_sample_indices(...) again as part of _generate_unsampled_indices(..) would seem to always generate different indices.

Am I missing something, or can this procedure never work correctly?

Issue Analytics

  • State:open
  • Created 3 years ago
  • Comments:17 (6 by maintainers)

github_iconTop GitHub Comments

1reaction
simontindemanscommented, Mar 9, 2021

@NicolasHug, I can confirm that the original code, although in a slightly obfuscated manner, does work correctly. See also my update in the top post. You can remove the bug label from this post, although the proposed modifications (and discussion in #19648) may still be useful to improve maintainability of the code.

0reactions
adrinjalalicommented, Aug 22, 2021

Removing the milestone, pinging @NicolasHug since you tagged this for 1.0

Read more comments on GitHub >

github_iconTop Results From Across the Web

What is Out of Bag (OOB) score in Random Forest?
This blog describes the intuition behind the Out of Bag (OOB) score in Random forest, how it is calculated and where it is...
Read more >
In a random forest, how do we calculate the OOB error ... - Quora
OOB stands for “out of band” - that means they are the error estimates obtained by predicting on data that was not (or...
Read more >
RandomForest no calculation of OOB error - Stack Overflow
The random.forest package only calculates OOB error ( err.rate ) and a confusion matrix when you're doing classification.
Read more >
Out-of-bag error - Wikipedia
Out-of-bag (OOB) error, also called out-of-bag estimate, is a method of measuring the prediction error of random forests, boosted decision trees, ...
Read more >
Random forests - classification description - Berkeley Statistics
In every tree grown in the forest, put down the oob cases and count the number of votes cast for the correct class....
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found