Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

OOB calculation in Random Forests seems to use incorrect indices

See original GitHub issue

update 9 March 2021

After the discussion below and further testing, I came to the conclusion that this is not a bug after all. The indices are computed correctly, because _make_estimator always generates a new random seed that is stored in the random_state variable of each tree – even if a np.random.RandomState() object was passed to _make_estimator.

Apologies for the false alarm. I got confused by the interchangeable use of random_state as a seed or a random generator (but, interestingly, not as the actual state of the random generator).

original report

I have been looking into the OOB score calculation for random forests, and I am concerned that the indices considered to be “out-of-bag” are not actually the same as those held out for training. A few years ago, PR https://github.com/scikit-learn/scikit-learn/issues/4783 was implemented, which replaced the explicit storage of indices by a regeneration procedure.

Specifically, this resulted in the following code fragment in _compute_oob_predictions(self, X, y) in sklearn/ensemble/_forest.py:

for estimator in self.estimators_:
    unsampled_indices = _generate_unsampled_indices(
        estimator.random_state, n_samples, n_samples_bootstrap,
    )

    y_pred = self._get_oob_predictions(
        estimator, X[unsampled_indices, :]
    )
    oob_pred[unsampled_indices, ...] += y_pred
    n_oob_pred[unsampled_indices, :] += 1

If estimator.random_state is a np.RandomState object instead of a constant seed then this appears to generate new random indices instead of the original set of indices. I noticed that this potential shortcoming was also observed by @amueller in the discussion about https://github.com/scikit-learn/scikit-learn/issues/4783, but no follow-up was included there:

Silly question: This fails if a np.RandomState was passed to the forest, right? Because then the random_state would have been altered by building the tree? (or I’m tired)

_Originally posted by @amueller in https://github.com/scikit-learn/scikit-learn/issues/4783#issuecomment-236036188_

I think this is more generally a concern, because even if a constant seed was passed to the RandomForest initializer, then BaseForest.fit(…) implements random_state = check_random_state(self.random_state) which makes random_state a RandomState() object, followed by (a few lines below): trees = [self._make_estimator(append=False, random_state=random_state) for i in range(n_more_estimators)] All trees appear to share the same RandomState() instance, which is then used to generate indices. Calling _generate_sample_indices(...) again as part of _generate_unsampled_indices(..) would seem to always generate different indices.

Am I missing something, or can this procedure never work correctly?

Issue Analytics

State:
Created 3 years ago
Comments:17 (6 by maintainers)

Top GitHub Comments

1reaction

simontindemanscommented, Mar 9, 2021

@NicolasHug, I can confirm that the original code, although in a slightly obfuscated manner, does work correctly. See also my update in the top post. You can remove the bug label from this post, although the proposed modifications (and discussion in #19648) may still be useful to improve maintainability of the code.

0reactions

adrinjalalicommented, Aug 22, 2021

Removing the milestone, pinging @NicolasHug since you tagged this for 1.0