Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Add random forest row-subsampling without replacement

See original GitHub issue

Describe the workflow you want to enable

I do appreciate the current option of disabling bootstrapping via the Boolean argument bootstrap. However, currently there is only one alternative: If False, the whole (identical) dataset is used to build each tree. There is well known research though (starting with the paper by Strobl et al. from 2007) that in certain situations subsamples drawn without replacement leads to better performance. Many well known random forest implementations (such as ranger ) offer this sub sampling as an alternative to the bootstrap.

I would greatly appreciate it if sklearn would offer the same.

Describe your proposed solution

For both RandomForestRegressor and RandomForestClassifier allow the user to draw subsamples without replacement for each tree instead of the bootstrap. The user can choose the fraction of sub samples drawn for each tree (default: 0.632 ) Ideally the functions _generate_unsampled_indices and _generate_sampled_indices would still work.

Describe alternatives you’ve considered, if relevant

No response

Additional context

No response

API considerations

Currently, random forests have the option bootstap=True or False. If this new feature of tree-wise row-subsampling without replacement is added, there are several options:

Add a new option with_replacement=True (default), or False, that only takes effect if bootstrap=True. Disadvantage: The term bootstrap explicitly means sampling with replacement.
Add a new option row_subsampling=True (default) or False, which samples with replacement if bootstrap=True and without replacement if bootstrap=False. Disadvantage: It would change current behaviour for bootstrap=False, which currently means no sampling at all.
Add new option sampling="bootstrap" (default), and allow callable / splitter to be passed. Deprecate option bootstrap (proposed in https://github.com/scikit-learn/scikit-learn/issues/20953#issuecomment-923957749).

Issue Analytics

State:
Created 2 years ago
Comments:15 (10 by maintainers)

Top GitHub Comments

4reactions

adrinjalalicommented, Sep 21, 2021

an alternative API would be to deprecate bootstrap, and add a sampling parameter, with "bootstrap" as default, and accept a callable, or a splitter which would give the subsamples.

2reactions

markusloechercommented, Sep 27, 2021

I also do not know references that investigated the impact of the row subsampling scheme w.r.t. predictive performance. However, in the light of the growing importance of interpretable machine learning, the sole focus on prediction loss is maybe less justifiable than in the past?

Top Results From Across the Web

Why can't we sample without replacement for each tree in a ...

No, the samples will not be independent, there is possibility the data samples will be skewed. For example, imagine a class-imbalanced ...

Understanding Sampling With and Without Replacement ...

Sampling with replacement can be defined as random sampling that allows sampling units to occur more than once. Sampling with replacement ...

Subsampling rows with replacement #1038 - GitHub

Row subsampling is done without replacement instead of with replacement. No OOB predictions. How realistic would it be to add a "bagging_with_replacement" ...

Why does random forest use sampling with replacement ...

During training, each tree in a random forest learns from a random sample of the data points. The samples are drawn with replacement,...

Selected features in random forest subsampling - Stack Overflow

Both. But, regarding the rows, it is not exactly subsampling, it is actually bootstrap sampling, i.e. sampling with replacement, which means ...