Add random forest row-subsampling without replacement
See original GitHub issueDescribe the workflow you want to enable
I do appreciate the current option of disabling bootstrapping via the Boolean argument bootstrap.
However, currently there is only one alternative: If False, the whole (identical) dataset is used to build each tree.
There is well known research though (starting with the paper by Strobl et al. from 2007) that in certain situations subsamples drawn without replacement leads to better performance.
Many well known random forest implementations (such as ranger ) offer this sub sampling as an alternative to the bootstrap.
I would greatly appreciate it if sklearn would offer the same.
Describe your proposed solution
For both RandomForestRegressor and RandomForestClassifier allow the user to draw subsamples without replacement for each tree instead of the bootstrap.
The user can choose the fraction of sub samples drawn for each tree (default: 0.632 )
Ideally the functions _generate_unsampled_indices and _generate_sampled_indices would still work.
Describe alternatives you’ve considered, if relevant
No response
Additional context
No response
API considerations
Currently, random forests have the option bootstap=True or False. If this new feature of tree-wise row-subsampling without replacement is added, there are several options:
- Add a new option
with_replacement=True(default), orFalse, that only takes effect ifbootstrap=True. Disadvantage: The term bootstrap explicitly means sampling with replacement. - Add a new option
row_subsampling=True(default) orFalse, which samples with replacement ifbootstrap=Trueand without replacement ifbootstrap=False. Disadvantage: It would change current behaviour forbootstrap=False, which currently means no sampling at all. - Add new option
sampling="bootstrap"(default), and allow callable / splitter to be passed. Deprecate optionbootstrap(proposed in https://github.com/scikit-learn/scikit-learn/issues/20953#issuecomment-923957749).
Issue Analytics
- State:
- Created 2 years ago
- Comments:15 (10 by maintainers)

Top Related StackOverflow Question
an alternative API would be to deprecate
bootstrap, and add asamplingparameter, with"bootstrap"as default, and accept a callable, or a splitter which would give the subsamples.I also do not know references that investigated the impact of the row subsampling scheme w.r.t. predictive performance. However, in the light of the growing importance of interpretable machine learning, the sole focus on prediction loss is maybe less justifiable than in the past?