question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Add random forest row-subsampling without replacement

See original GitHub issue

Describe the workflow you want to enable

I do appreciate the current option of disabling bootstrapping via the Boolean argument bootstrap. However, currently there is only one alternative: If False, the whole (identical) dataset is used to build each tree. There is well known research though (starting with the paper by Strobl et al. from 2007) that in certain situations subsamples drawn without replacement leads to better performance. Many well known random forest implementations (such as ranger ) offer this sub sampling as an alternative to the bootstrap.

I would greatly appreciate it if sklearn would offer the same.

Describe your proposed solution

For both RandomForestRegressor and RandomForestClassifier allow the user to draw subsamples without replacement for each tree instead of the bootstrap. The user can choose the fraction of sub samples drawn for each tree (default: 0.632 ) Ideally the functions _generate_unsampled_indices and _generate_sampled_indices would still work.

Describe alternatives you’ve considered, if relevant

No response

Additional context

No response

API considerations

Currently, random forests have the option bootstap=True or False. If this new feature of tree-wise row-subsampling without replacement is added, there are several options:

  1. Add a new option with_replacement=True (default), or False, that only takes effect if bootstrap=True. Disadvantage: The term bootstrap explicitly means sampling with replacement.
  2. Add a new option row_subsampling=True (default) or False, which samples with replacement if bootstrap=True and without replacement if bootstrap=False. Disadvantage: It would change current behaviour for bootstrap=False, which currently means no sampling at all.
  3. Add new option sampling="bootstrap" (default), and allow callable / splitter to be passed. Deprecate option bootstrap (proposed in https://github.com/scikit-learn/scikit-learn/issues/20953#issuecomment-923957749).

Issue Analytics

  • State:open
  • Created 2 years ago
  • Comments:15 (10 by maintainers)

github_iconTop GitHub Comments

4reactions
adrinjalalicommented, Sep 21, 2021

an alternative API would be to deprecate bootstrap, and add a sampling parameter, with "bootstrap" as default, and accept a callable, or a splitter which would give the subsamples.

2reactions
markusloechercommented, Sep 27, 2021

I also do not know references that investigated the impact of the row subsampling scheme w.r.t. predictive performance. However, in the light of the growing importance of interpretable machine learning, the sole focus on prediction loss is maybe less justifiable than in the past?

Read more comments on GitHub >

github_iconTop Results From Across the Web

Why can't we sample without replacement for each tree in a ...
No, the samples will not be independent, there is possibility the data samples will be skewed. For example, imagine a class-imbalanced ...
Read more >
Understanding Sampling With and Without Replacement ...
Sampling with replacement can be defined as random sampling that allows sampling units to occur more than once. Sampling with replacement ...
Read more >
Subsampling rows with replacement #1038 - GitHub
Row subsampling is done without replacement instead of with replacement. No OOB predictions. How realistic would it be to add a "bagging_with_replacement" ...
Read more >
Why does random forest use sampling with replacement ...
During training, each tree in a random forest learns from a random sample of the data points. The samples are drawn with replacement,...
Read more >
Selected features in random forest subsampling - Stack Overflow
Both. But, regarding the rows, it is not exactly subsampling, it is actually bootstrap sampling, i.e. sampling with replacement, which means ...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found