Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Feature request: Random Forest node splitting sample size

See original GitHub issue

During the development of a model, the faster one can iterate, the better. That means getting your training time down to a manageable level, albeit at the cost of some accuracy. With large training sets, training time for random forests can be an impediment to rapid iteration. @jph00 showed me this awesome trick that convinces scikit-learn’s RF implementation to use a subsample of the full training data set when selecting split variables and values at each node:

def set_RF_sample_size(n):
    forest._generate_sample_indices = \
        (lambda rs, n_samples: forest.check_random_state(rs).randint(0, n_samples,n))

def reset_RF_sample_size():
    forest._generate_sample_indices = (lambda rs, n_samples:
        forest.check_random_state(rs).randint(0, n_samples, n_samples))

We both find that splitting using say 20,000 samples rather than the full 400,000 samples gives nearly the same results but with much faster training time. Naturally it requires a bit of experimentation for each data set to find a suitable subsample size, but in general we find that feature engineering and other preparation work is totally fine using a subsample. We then reset the sample size for the final model.

We appreciate your consideration of this feature request!

Issue Analytics

State:
Created 5 years ago
Reactions:3
Comments:13 (8 by maintainers)

Top GitHub Comments

1reaction

reshamascommented, Oct 20, 2020

@cmarmo Ok, I labeled it moderate, so it is not picked up by beginners at sprints.

0reactions

cmarmocommented, Oct 20, 2020

Do you think this issue is beginner friendly? Is it doable for someone at one of our sprints?

@reshamas, checking the pull requests that tried to solve it I’m under the impression that this issue does not have a straightforward solution. Perhaps it will require some experience to be taken over.