question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Feature request: Random Forest node splitting sample size

See original GitHub issue

During the development of a model, the faster one can iterate, the better. That means getting your training time down to a manageable level, albeit at the cost of some accuracy. With large training sets, training time for random forests can be an impediment to rapid iteration. @jph00 showed me this awesome trick that convinces scikit-learn’s RF implementation to use a subsample of the full training data set when selecting split variables and values at each node:

def set_RF_sample_size(n):
    forest._generate_sample_indices = \
        (lambda rs, n_samples: forest.check_random_state(rs).randint(0, n_samples,n))

def reset_RF_sample_size():
    forest._generate_sample_indices = (lambda rs, n_samples:
        forest.check_random_state(rs).randint(0, n_samples, n_samples))

We both find that splitting using say 20,000 samples rather than the full 400,000 samples gives nearly the same results but with much faster training time. Naturally it requires a bit of experimentation for each data set to find a suitable subsample size, but in general we find that feature engineering and other preparation work is totally fine using a subsample. We then reset the sample size for the final model.

We appreciate your consideration of this feature request!

Issue Analytics

  • State:open
  • Created 5 years ago
  • Reactions:3
  • Comments:13 (8 by maintainers)

github_iconTop GitHub Comments

1reaction
reshamascommented, Oct 20, 2020

@cmarmo Ok, I labeled it moderate, so it is not picked up by beginners at sprints.

0reactions
cmarmocommented, Oct 20, 2020

Do you think this issue is beginner friendly? Is it doable for someone at one of our sprints?

@reshamas, checking the pull requests that tried to solve it I’m under the impression that this issue does not have a straightforward solution. Perhaps it will require some experience to be taken over.

Read more comments on GitHub >

github_iconTop Results From Across the Web

Diversity Forests: Using Split Sampling to Enable ...
The diversity forest algorithm is an alternative candidate node split sampling scheme that makes innovative complex split procedures in ...
Read more >
Decision Trees and Random Forests — Explained
Splits that increase purity of nodes are more informative. The purity of a node is inversely proportional to the distribution of different ...
Read more >
Random Forests From Scratch
Binary decision trees can have up to size 2^{d+1}-1, where d is the depth of the tree, so for example, a tree with...
Read more >
A complete guide to Random Forest in R
This tutorial includes step by step guide to run random forest in R. It ... Random Forest does not require split sampling method...
Read more >
Hyperparameters and Tuning Strategies for Random Forest
This can be controlled by the parameters mtry, sample size and node size which will be presented in Section 2.1.1, 2.1.2 and 2.1.3,...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found