OOB calculation in Random Forests seems to use incorrect indices
See original GitHub issueupdate 9 March 2021
After the discussion below and further testing, I came to the conclusion that this is not a bug after all. The indices are computed correctly, because _make_estimator
always generates a new random seed that is stored in the random_state
variable of each tree – even if a np.random.RandomState()
object was passed to _make_estimator
.
Apologies for the false alarm. I got confused by the interchangeable use of random_state
as a seed or a random generator (but, interestingly, not as the actual state of the random generator).
original report
I have been looking into the OOB score calculation for random forests, and I am concerned that the indices considered to be “out-of-bag” are not actually the same as those held out for training. A few years ago, PR https://github.com/scikit-learn/scikit-learn/issues/4783 was implemented, which replaced the explicit storage of indices by a regeneration procedure.
Specifically, this resulted in the following code fragment in _compute_oob_predictions(self, X, y)
in sklearn/ensemble/_forest.py
:
for estimator in self.estimators_:
unsampled_indices = _generate_unsampled_indices(
estimator.random_state, n_samples, n_samples_bootstrap,
)
y_pred = self._get_oob_predictions(
estimator, X[unsampled_indices, :]
)
oob_pred[unsampled_indices, ...] += y_pred
n_oob_pred[unsampled_indices, :] += 1
If estimator.random_state
is a np.RandomState object instead of a constant seed then this appears to generate new random indices instead of the original set of indices. I noticed that this potential shortcoming was also observed by @amueller in the discussion about https://github.com/scikit-learn/scikit-learn/issues/4783, but no follow-up was included there:
Silly question: This fails if a
np.RandomState
was passed to the forest, right? Because then therandom_state
would have been altered by building the tree? (or I’m tired)
_Originally posted by @amueller in https://github.com/scikit-learn/scikit-learn/issues/4783#issuecomment-236036188_
I think this is more generally a concern, because even if a constant seed was passed to the RandomForest initializer, then BaseForest.fit(…) implements
random_state = check_random_state(self.random_state)
which makes random_state
a RandomState() object, followed by (a few lines below):
trees = [self._make_estimator(append=False, random_state=random_state) for i in range(n_more_estimators)]
All trees appear to share the same RandomState() instance, which is then used to generate indices. Calling _generate_sample_indices(...)
again as part of _generate_unsampled_indices(..)
would seem to always generate different indices.
Am I missing something, or can this procedure never work correctly?
Issue Analytics
- State:
- Created 3 years ago
- Comments:17 (6 by maintainers)
Top GitHub Comments
@NicolasHug, I can confirm that the original code, although in a slightly obfuscated manner, does work correctly. See also my update in the top post. You can remove the bug label from this post, although the proposed modifications (and discussion in #19648) may still be useful to improve maintainability of the code.
Removing the milestone, pinging @NicolasHug since you tagged this for 1.0